6 datasets found
  1. US. baby names 1880 - 2022

    • kaggle.com
    Updated Jul 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saleh Zeer (2023). US. baby names 1880 - 2022 [Dataset]. https://www.kaggle.com/datasets/salehzeer/babynames
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saleh Zeer
    Description

    # National Data on the relative frequency of given names in the population of U.S. births where the individual has a Social Security Number

    For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to determine a name's rank. The first record for each sex has rank 1, the second record for each sex has rank 2, and so forth. To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.

    https://www.ssa.gov/oact/babynames/limits.html

  2. A

    ‘US Health Insurance Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘US Health Insurance Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-health-insurance-dataset-920a/latest
    Explore at:
    Dataset updated
    Feb 29, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘US Health Insurance Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/teertha/ushealthinsurancedataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries, even within the BFSI sphere (compared to the Banking sector for example.) Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.

    Despite this, emergent technologies like AI and Block Chain have brought a radical change in Insurance, and Data Analytics sits at the core of this transformation. We can identify 4 key factors behind the emergence of Analytics as a crucial part of InsurTech:

    • Big Data: The explosion of unstructured data in the form of images, videos, text, emails, social media
    • AI: The recent advances in Machine Learning and Deep Learning that can enable businesses to gain insight, do predictive analytics and build cost and time - efficient innovative solutions
    • Real time Processing: Ability of real time information processing through various data feeds (for ex. social media, news)
    • Increased Computing Power: a complex ecosystem of new analytics vendors and solutions that enable carriers to combine data sources, external insights, and advanced modeling techniques in order to glean insights that were not possible before.

    This dataset can be helpful in a simple yet illuminating study in understanding the risk underwriting in Health Insurance, the interplay of various attributes of the insured and see how they affect the insurance premium.

    Content

    This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. There are no missing or undefined values in the dataset.

    Inspiration

    This relatively simple dataset should be an excellent starting point for EDA, Statistical Analysis and Hypothesis testing and training Linear Regression models for predicting Insurance Premium Charges.

    Proposed Tasks: - Exploratory Data Analytics - Statistical hypothesis testing - Statistical Modeling - Linear Regression

    --- Original source retains full ownership of the source dataset ---

  3. United States Baby Names Count

    • kaggle.com
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). United States Baby Names Count [Dataset]. https://www.kaggle.com/datasets/thedevastator/united-states-baby-names-count/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    United States Baby Names Count

    United States Baby Names Dataset

    By Amber Thomas [source]

    About this dataset

    The data is based on a complete sample of records on Social Security card applications as of March 2021 and is presented in three main files: baby-names-national.csv, baby-names-state.csv, and baby-names-territories.csv. These files contain detailed information about names given to babies at the national level (50 states and District of Columbia), state level (individual states), and territory level (including American Samoa, Guam, Northern Mariana Islands Puerto Rico and U.S. Virgin Islands) respectively.

    Each entry in the dataset includes several key attributes such as state_abb or territory_code representing the abbreviation or code indicating the specific state or territory where the baby was born. The sex attribute denotes the gender of each baby – either male or female – while year represents the specific birth year when each baby was born.

    Another important attribute is name which indicates given name selected for each individual newborn.The count attribute provides numerical data about how many babies received a particular name within a specific state/territory, gender combination for a given year.

    It's also worth noting that all names included have at least two characters in length to ensure high data quality standards.

    How to use the dataset

    - Understanding the Columns

    The dataset consists of multiple columns with specific information about each baby name entry. Here are the key columns in this dataset:

    • state_abb: The abbreviation of the state or territory where the baby was born.
    • sex: The gender of the baby.
    • year: The year in which the baby was born.
    • name: The given name of the baby.
    • count: The number of babies with a specific name born in a certain state, gender, and year.

    - Exploring National Data

    To analyze national trends or overall popularity across all states and years: a) Focus on baby-names-national.csv. b) Use columns like name, sex, year, and count to study trends over time.

    - Analyzing State-Level Data

    To examine specific states' data: a) Utilize baby-names-state.csv file. b) Filter data by desired states using state_abb column values. c) Combine analysis with other relevant attributes like gender, year, etc., for detailed insights.

    - Understanding Territory Data

    For insights into United States territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, U.S Virgin Islands): a) Access informative data from baby-names-territories.csv. b) Analyze based on similar principles as state-level data but considering unique territory factors.

    - Gender-Specific Analysis

    You can study names' popularity specifically among males or females by filtering the data using the sex column. This will allow you to explore gender-specific naming trends and preferences.

    - Identifying Regional Patterns

    To identify naming patterns in specific regions: a) Analyze state-level or territory-level data. b) Look for variations in name popularity across different states or territories.

    - Analyzing Name Popularity over Time

    Track the popularity of specific names over time using the name, year, and count columns. This can help uncover trends, fluctuations, and changes in names' usage and popularity.

    - Comparing Names and Variations

    Use this

    Research Ideas

    • Tracking Popularity Trends: This dataset can be used to analyze the popularity of baby names over time. By examining the count of babies with a specific name born in different years, trends and shifts in naming preferences can be identified.
    • Gender Analysis: The dataset includes information on the gender of each baby. It can be used to study gender patterns and differences in naming choices. For example, it would be possible to compare the frequency and popularity of certain names among males and females.
    • Regional Variations: With state abbreviations provided, it is possible to explore regional variations in baby naming trends within the United States. Researchers could examine how certain names are more popular or unique to specific states or territories, highlighting cultural or geographical factors that influence naming choices

    Acknowledgements

    If you use this dataset in your research, please credit the original a...

  4. Datasets for research on Resilience of Blockchain Overlay Networks

    • springernature.figshare.com
    txt
    Updated Aug 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aristodemos Paphitis; Nicolas Kourtellis; Michael Sirivianos (2023). Datasets for research on Resilience of Blockchain Overlay Networks [Dataset]. http://doi.org/10.6084/m9.figshare.23522919.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 2, 2023
    Dataset provided by
    figshare
    Authors
    Aristodemos Paphitis; Nicolas Kourtellis; Michael Sirivianos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset ReadMe

    Overview

    • This data was collected to study the structural properties of blockchain overlay networks. It contains several snapshots of seven different networks from the period 26 Jun 2020 to 20 July 2020.
    • The networks included are (alphabetically): Bitcoin, Bitcoin Cash, Dash, Dogecoin, Ethereum, Litecoin, and ZCash.
    • Studying the graph characteristics of these networks is beneficial;

      • It helps us evaluate the system's performance, robustness, and scalability by examining the network structure, node distribution, and communication patterns.
      • This analysis helps us identify bottlenecks and find ways to optimize efficiency and throughput.

      Moreover, understanding the vulnerabilities and attack possibilities unique to these networks allows us to develop proactive defense mechanisms and mitigate potential threats.

    Data collection method: ask all reachable nodes continuously for their known peers. In Bitcoin's parlor, we send GETADDR messages and store all ADDR replies, drawing a connection between the sending node to all ip addresses contained in the ADDR message.

    Data Description

    All IP addresses have been replaced by numbers (NodeID) for ethical reasons. NodeIDs are consistent accross all files. The same NodeID corresponds to the same ip in ALL files (if present). Filenames contain the timestamp and the corresponding network. The date-time format is YYYYMMDD-HHMISS.

    • File Contents: The edgelist files store information about the structure of the connectivity graph. Each file represents an edgelist of a graph at the specified time-stamp. Each line in a file corresponds the the list of known peers to a node. The NodeID of the node is the first number of each line. Example: the following line

      S N1 N2 N3 N4

      means that node S knows of nodes N1..N4; their ip addresses were included in S's ADDR responses.

    To process the files in snap and networkx proper transformations have to be made. Please read the relevant documentation to find the appropriate input.

    Research

    This dataset has been used in the following works: - @inproceedings{aris_ssec,
    author = {Paphitis, Aristodemos and Kourtellis, Nicolas and Sirivianos, Michael}, title = {Graph Analysis of Blockchain {P2P} Overlays and their Security Implications}, booktitle = {Proceedings of the 9th International Symposium on Security and Privacy in Social Networks and Big Data (SocialSec 2023)}, series = {Lecture Notes in Computer Science}, volume = {13983}, publisher = {Springer Nature}, year = {2023}, }

    • @inproceedings{aris_nss,
      author = {Paphitis, Aristodemos and Kourtellis, Nicolas and Sirivianos, Michael}, title = {Resilience of Blockchain Overlay Networks}, booktitle = {Proceedings of the 17th International Conference on Network and System Security (NSS 2023)},
      series = {Lecture Notes in Computer Science}, volume = {14097}, publisher = {Springer Nature}, year = {2023}, }

    License and Attribution

    • You are free to share and adapt according to CC BY (4.0) https://creativecommons.org/licenses/by/4.0/
    • Please cite as:

      Aristodemos Paphitis, Nicolas Kourtellis, and Michael Sirivianos. A First Look into the Structural Properties of Blockchain P2P Overlays. DOI:https://doi.org/10.6084/m9.figshare.23522919

    • bibtex:

      @misc{paphitis_first_nodate,
      author = {Paphitis, Aristodemos and Kourtellis, Nicolas and Sirivianos, Michael}, title = {A First Look into the Structural Properties of Blockchain {P2P} Overlays}, howpublished = {Public dataset with figshare}, doi = {10.6084/m9.figshare.23522919}, }

    Contact Information

  5. Communities and Crime Dataset (Unnormalized Data)

    • kaggle.com
    Updated Feb 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John (2023). Communities and Crime Dataset (Unnormalized Data) [Dataset]. https://www.kaggle.com/datasets/johnp47/communities-and-crime-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    John
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source:

    Creator: Michael Redmond (redmond '@' lasalle.edu); Computer Science; La Salle University; Philadelphia, PA, 19141, USA -- culled from 1990 US Census, 1995 US FBI Uniform Crime Report, 1990 US Law Enforcement Management and Administrative Statistics Survey, available from ICPSR at U of Michigan. -- Donor: Michael Redmond (redmond '@' lasalle.edu); Computer Science; La Salle University; Philadelphia, PA, 19141, USA -- Date: July 2009

    Data Set Information:

    Many variables are included so that algorithms that select or learn weights for attributes could be tested. However, clearly unrelated attributes were not included; attributes were picked if there was any plausible connection to crime (N=122), plus the attribute to be predicted (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units.

    The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. There was apparently some controversy in some states concerning the counting of rapes. These resulted in missing values for rape, which resulted in incorrect values for per capita violent crime. These cities are not included in the dataset. Many of these omitted communities were from the midwestern USA.

    Data is described below based on original values. All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Attributes retain their distribution and skew (hence for example the population attribute has a mean value of 0.06 because most communities are small). E.g. An attribute described as 'mean people per household' is actually the normalized (0-1) version of that value.

    The normalization preserves rough ratios of values WITHIN an attribute (e.g. double the value for double the population within the available precision - except for extreme values (all values more than 3 SD above the mean are normalized to 1.00; all values more than 3 SD below the mean are normalized to 0.00)).

    However, the normalization does not preserve relationships between values BETWEEN attributes (e.g. it would not be meaningful to compare the value for whitePerCap with the value for blackPerCap for a community)

    A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.

    Attribute Information:

    '(125 predictive, 4 non-predictive, 18 potential goal) ', ' communityname: Community name - not predictive - for information only (string) ', ' state: US state (by 2 letter postal abbreviation)(nominal) ', ' countyCode: numeric code for county - not predictive, and many missing values (numeric) ', ' communityCode: numeric code for community - not predictive and many missing values (numeric) ', ' fold: fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive (numeric - integer) ', ' population: population for community: (numeric - expected to be integer) ', ' householdsize: mean people per household (numeric - decimal) ', ' racepctblack: percentage of population that is african american (numeric - decimal) ', ' racePctWhite: percentage of population that is caucasian (numeric - decimal) ', ' racePctAsian: percentage of population that is of asian heritage (numeric - decimal) ', ' racePctHisp: percentage of population that is of hispanic heritage (numeric - decimal) ', ' agePct12t21: percentage of population that is 12-21 in age (numeric - decimal) ', ' agePct12t29: percentage of population that is 12-29 in age (numeric - decimal) ', ' agePct16t24: percentage of population that is 16-24 in age (numeric - decimal) ', ' agePct65up: percentage of population that is 65 and over in age (numeric - decimal) ', ' numbUrban: number of people living in areas classified as urban (numeric - expected to be integer) ', ' pctUrban: percentage of people living in areas classified as urban (numeric - decimal) ', ' medIncome: median household income (numeric - may be integer) ', ' pctWWage: percentage of households with wage or salary income in 1989 (numeric - decimal) ', ' pctWFarmSelf: percentage of households with farm or self employment income in 1989 (numeric - decimal) ', ' pctWInvInc: percentage of households with investment / rent income in 1989 (numeric - decimal) ', ' pctWSocSec: percentage of households with social security income in 1989 (numeric - decimal) ', ' pctWPubAsst: pe...

  6. COVID Impact Survey - Public Data

    • data.world
    csv, zip
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Associated Press (2024). COVID Impact Survey - Public Data [Dataset]. https://data.world/associatedpress/covid-impact-survey-public-data
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    data.world, Inc.
    Authors
    The Associated Press
    Description

    Overview

    The Associated Press is sharing data from the COVID Impact Survey, which provides statistics about physical health, mental health, economic security and social dynamics related to the coronavirus pandemic in the United States.

    Conducted by NORC at the University of Chicago for the Data Foundation, the probability-based survey provides estimates for the United States as a whole, as well as in 10 states (California, Colorado, Florida, Louisiana, Minnesota, Missouri, Montana, New York, Oregon and Texas) and eight metropolitan areas (Atlanta, Baltimore, Birmingham, Chicago, Cleveland, Columbus, Phoenix and Pittsburgh).

    The survey is designed to allow for an ongoing gauge of public perception, health and economic status to see what is shifting during the pandemic. When multiple sets of data are available, it will allow for the tracking of how issues ranging from COVID-19 symptoms to economic status change over time.

    The survey is focused on three core areas of research:

    • Physical Health: Symptoms related to COVID-19, relevant existing conditions and health insurance coverage.
    • Economic and Financial Health: Employment, food security, and government cash assistance.
    • Social and Mental Health: Communication with friends and family, anxiety and volunteerism. (Questions based on those used on the U.S. Census Bureau’s Current Population Survey.) ## Using this Data - IMPORTANT This is survey data and must be properly weighted during analysis: DO NOT REPORT THIS DATA AS RAW OR AGGREGATE NUMBERS!!

    Instead, use our queries linked below or statistical software such as R or SPSS to weight the data.

    Queries

    If you'd like to create a table to see how people nationally or in your state or city feel about a topic in the survey, use the survey questionnaire and codebook to match a question (the variable label) to a variable name. For instance, "How often have you felt lonely in the past 7 days?" is variable "soc5c".

    Nationally: Go to this query and enter soc5c as the variable. Hit the blue Run Query button in the upper right hand corner.

    Local or State: To find figures for that response in a specific state, go to this query and type in a state name and soc5c as the variable, and then hit the blue Run Query button in the upper right hand corner.

    The resulting sentence you could write out of these queries is: "People in some states are less likely to report loneliness than others. For example, 66% of Louisianans report feeling lonely on none of the last seven days, compared with 52% of Californians. Nationally, 60% of people said they hadn't felt lonely."

    Margin of Error

    The margin of error for the national and regional surveys is found in the attached methods statement. You will need the margin of error to determine if the comparisons are statistically significant. If the difference is:

    • At least twice the margin of error, you can report there is a clear difference.
    • At least as large as the margin of error, you can report there is a slight or apparent difference.
    • Less than or equal to the margin of error, you can report that the respondents are divided or there is no difference. ## A Note on Timing Survey results will generally be posted under embargo on Tuesday evenings. The data is available for release at 1 p.m. ET Thursdays.

    About the Data

    The survey data will be provided under embargo in both comma-delimited and statistical formats.

    Each set of survey data will be numbered and have the date the embargo lifts in front of it in the format of: 01_April_30_covid_impact_survey. The survey has been organized by the Data Foundation, a non-profit non-partisan think tank, and is sponsored by the Federal Reserve Bank of Minneapolis and the Packard Foundation. It is conducted by NORC at the University of Chicago, a non-partisan research organization. (NORC is not an abbreviation, it part of the organization's formal name.)

    Data for the national estimates are collected using the AmeriSpeak Panel, NORC’s probability-based panel designed to be representative of the U.S. household population. Interviews are conducted with adults age 18 and over representing the 50 states and the District of Columbia. Panel members are randomly drawn from AmeriSpeak with a target of achieving 2,000 interviews in each survey. Invited panel members may complete the survey online or by telephone with an NORC telephone interviewer.

    Once all the study data have been made final, an iterative raking process is used to adjust for any survey nonresponse as well as any noncoverage or under and oversampling resulting from the study specific sample design. Raking variables include age, gender, census division, race/ethnicity, education, and county groupings based on county level counts of the number of COVID-19 deaths. Demographic weighting variables were obtained from the 2020 Current Population Survey. The count of COVID-19 deaths by county was obtained from USA Facts. The weighted data reflect the U.S. population of adults age 18 and over.

    Data for the regional estimates are collected using a multi-mode address-based (ABS) approach that allows residents of each area to complete the interview via web or with an NORC telephone interviewer. All sampled households are mailed a postcard inviting them to complete the survey either online using a unique PIN or via telephone by calling a toll-free number. Interviews are conducted with adults age 18 and over with a target of achieving 400 interviews in each region in each survey.Additional details on the survey methodology and the survey questionnaire are attached below or can be found at https://www.covid-impact.org.

    Attribution

    Results should be credited to the COVID Impact Survey, conducted by NORC at the University of Chicago for the Data Foundation.

    AP Data Distributions

    ​To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Saleh Zeer (2023). US. baby names 1880 - 2022 [Dataset]. https://www.kaggle.com/datasets/salehzeer/babynames
Organization logo

US. baby names 1880 - 2022

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saleh Zeer
Description

# National Data on the relative frequency of given names in the population of U.S. births where the individual has a Social Security Number

For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to determine a name's rank. The first record for each sex has rank 1, the second record for each sex has rank 2, and so forth. To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.

https://www.ssa.gov/oact/babynames/limits.html

Search
Clear search
Close search
Google apps
Main menu