# National Data on the relative frequency of given names in the population of U.S. births where the individual has a Social Security Number
For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to determine a name's rank. The first record for each sex has rank 1, the second record for each sex has rank 2, and so forth. To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘US Health Insurance Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/teertha/ushealthinsurancedataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries, even within the BFSI sphere (compared to the Banking sector for example.) Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.
Despite this, emergent technologies like AI and Block Chain have brought a radical change in Insurance, and Data Analytics sits at the core of this transformation. We can identify 4 key factors behind the emergence of Analytics as a crucial part of InsurTech:
This dataset can be helpful in a simple yet illuminating study in understanding the risk underwriting in Health Insurance, the interplay of various attributes of the insured and see how they affect the insurance premium.
This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. There are no missing or undefined values in the dataset.
This relatively simple dataset should be an excellent starting point for EDA, Statistical Analysis and Hypothesis testing and training Linear Regression models for predicting Insurance Premium Charges.
Proposed Tasks: - Exploratory Data Analytics - Statistical hypothesis testing - Statistical Modeling - Linear Regression
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Amber Thomas [source]
The data is based on a complete sample of records on Social Security card applications as of March 2021 and is presented in three main files: baby-names-national.csv, baby-names-state.csv, and baby-names-territories.csv. These files contain detailed information about names given to babies at the national level (50 states and District of Columbia), state level (individual states), and territory level (including American Samoa, Guam, Northern Mariana Islands Puerto Rico and U.S. Virgin Islands) respectively.
Each entry in the dataset includes several key attributes such as state_abb or territory_code representing the abbreviation or code indicating the specific state or territory where the baby was born. The sex attribute denotes the gender of each baby – either male or female – while year represents the specific birth year when each baby was born.
Another important attribute is name which indicates given name selected for each individual newborn.The count attribute provides numerical data about how many babies received a particular name within a specific state/territory, gender combination for a given year.
It's also worth noting that all names included have at least two characters in length to ensure high data quality standards.
- Understanding the Columns
The dataset consists of multiple columns with specific information about each baby name entry. Here are the key columns in this dataset:
- state_abb: The abbreviation of the state or territory where the baby was born.
- sex: The gender of the baby.
- year: The year in which the baby was born.
- name: The given name of the baby.
count: The number of babies with a specific name born in a certain state, gender, and year.
- Exploring National Data
To analyze national trends or overall popularity across all states and years: a) Focus on baby-names-national.csv. b) Use columns like name, sex, year, and count to study trends over time.
- Analyzing State-Level Data
To examine specific states' data: a) Utilize baby-names-state.csv file. b) Filter data by desired states using state_abb column values. c) Combine analysis with other relevant attributes like gender, year, etc., for detailed insights.
- Understanding Territory Data
For insights into United States territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, U.S Virgin Islands): a) Access informative data from baby-names-territories.csv. b) Analyze based on similar principles as state-level data but considering unique territory factors.
- Gender-Specific Analysis
You can study names' popularity specifically among males or females by filtering the data using the sex column. This will allow you to explore gender-specific naming trends and preferences.
- Identifying Regional Patterns
To identify naming patterns in specific regions: a) Analyze state-level or territory-level data. b) Look for variations in name popularity across different states or territories.
- Analyzing Name Popularity over Time
Track the popularity of specific names over time using the name, year, and count columns. This can help uncover trends, fluctuations, and changes in names' usage and popularity.
- Comparing Names and Variations
Use this
- Tracking Popularity Trends: This dataset can be used to analyze the popularity of baby names over time. By examining the count of babies with a specific name born in different years, trends and shifts in naming preferences can be identified.
- Gender Analysis: The dataset includes information on the gender of each baby. It can be used to study gender patterns and differences in naming choices. For example, it would be possible to compare the frequency and popularity of certain names among males and females.
- Regional Variations: With state abbreviations provided, it is possible to explore regional variations in baby naming trends within the United States. Researchers could examine how certain names are more popular or unique to specific states or territories, highlighting cultural or geographical factors that influence naming choices
If you use this dataset in your research, please credit the original a...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Studying the graph characteristics of these networks is beneficial;
Moreover, understanding the vulnerabilities and attack possibilities unique to these networks allows us to develop proactive defense mechanisms and mitigate potential threats.
Data collection method: ask all reachable nodes continuously for their known peers. In Bitcoin's parlor, we send GETADDR messages and store all ADDR replies, drawing a connection between the sending node to all ip addresses contained in the ADDR message.
All IP addresses have been replaced by numbers (NodeID) for ethical reasons. NodeIDs are consistent accross all files. The same NodeID corresponds to the same ip in ALL files (if present). Filenames contain the timestamp and the corresponding network. The date-time format is YYYYMMDD-HHMISS.
File Contents: The edgelist files store information about the structure of the connectivity graph. Each file represents an edgelist of a graph at the specified time-stamp. Each line in a file corresponds the the list of known peers to a node. The NodeID of the node is the first number of each line. Example: the following line
S N1 N2 N3 N4
means that node S knows of nodes N1..N4; their ip addresses were included in S's ADDR responses.
To process the files in snap and networkx proper transformations have to be made. Please read the relevant documentation to find the appropriate input.
This dataset has been used in the following works:
- @inproceedings{aris_ssec,
author = {Paphitis, Aristodemos and Kourtellis, Nicolas and Sirivianos, Michael},
title = {Graph Analysis of Blockchain {P2P} Overlays and their Security Implications},
booktitle = {Proceedings of the 9th International Symposium on Security and Privacy in Social Networks and Big Data (SocialSec 2023)},
series = {Lecture Notes in Computer Science},
volume = {13983},
publisher = {Springer Nature},
year = {2023},
}
Please cite as:
Aristodemos Paphitis, Nicolas Kourtellis, and Michael Sirivianos. A First Look into the Structural Properties of Blockchain P2P Overlays. DOI:https://doi.org/10.6084/m9.figshare.23522919
bibtex:
@misc{paphitis_first_nodate,
author = {Paphitis, Aristodemos and Kourtellis, Nicolas and Sirivianos, Michael},
title = {A First Look into the Structural Properties of Blockchain {P2P} Overlays},
howpublished = {Public dataset with figshare},
doi = {10.6084/m9.figshare.23522919},
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source:
Creator: Michael Redmond (redmond '@' lasalle.edu); Computer Science; La Salle University; Philadelphia, PA, 19141, USA -- culled from 1990 US Census, 1995 US FBI Uniform Crime Report, 1990 US Law Enforcement Management and Administrative Statistics Survey, available from ICPSR at U of Michigan. -- Donor: Michael Redmond (redmond '@' lasalle.edu); Computer Science; La Salle University; Philadelphia, PA, 19141, USA -- Date: July 2009
Data Set Information:
Many variables are included so that algorithms that select or learn weights for attributes could be tested. However, clearly unrelated attributes were not included; attributes were picked if there was any plausible connection to crime (N=122), plus the attribute to be predicted (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units.
The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. There was apparently some controversy in some states concerning the counting of rapes. These resulted in missing values for rape, which resulted in incorrect values for per capita violent crime. These cities are not included in the dataset. Many of these omitted communities were from the midwestern USA.
Data is described below based on original values. All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Attributes retain their distribution and skew (hence for example the population attribute has a mean value of 0.06 because most communities are small). E.g. An attribute described as 'mean people per household' is actually the normalized (0-1) version of that value.
The normalization preserves rough ratios of values WITHIN an attribute (e.g. double the value for double the population within the available precision - except for extreme values (all values more than 3 SD above the mean are normalized to 1.00; all values more than 3 SD below the mean are normalized to 0.00)).
However, the normalization does not preserve relationships between values BETWEEN attributes (e.g. it would not be meaningful to compare the value for whitePerCap with the value for blackPerCap for a community)
A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.
Attribute Information:
'(125 predictive, 4 non-predictive, 18 potential goal) ', ' communityname: Community name - not predictive - for information only (string) ', ' state: US state (by 2 letter postal abbreviation)(nominal) ', ' countyCode: numeric code for county - not predictive, and many missing values (numeric) ', ' communityCode: numeric code for community - not predictive and many missing values (numeric) ', ' fold: fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive (numeric - integer) ', ' population: population for community: (numeric - expected to be integer) ', ' householdsize: mean people per household (numeric - decimal) ', ' racepctblack: percentage of population that is african american (numeric - decimal) ', ' racePctWhite: percentage of population that is caucasian (numeric - decimal) ', ' racePctAsian: percentage of population that is of asian heritage (numeric - decimal) ', ' racePctHisp: percentage of population that is of hispanic heritage (numeric - decimal) ', ' agePct12t21: percentage of population that is 12-21 in age (numeric - decimal) ', ' agePct12t29: percentage of population that is 12-29 in age (numeric - decimal) ', ' agePct16t24: percentage of population that is 16-24 in age (numeric - decimal) ', ' agePct65up: percentage of population that is 65 and over in age (numeric - decimal) ', ' numbUrban: number of people living in areas classified as urban (numeric - expected to be integer) ', ' pctUrban: percentage of people living in areas classified as urban (numeric - decimal) ', ' medIncome: median household income (numeric - may be integer) ', ' pctWWage: percentage of households with wage or salary income in 1989 (numeric - decimal) ', ' pctWFarmSelf: percentage of households with farm or self employment income in 1989 (numeric - decimal) ', ' pctWInvInc: percentage of households with investment / rent income in 1989 (numeric - decimal) ', ' pctWSocSec: percentage of households with social security income in 1989 (numeric - decimal) ', ' pctWPubAsst: pe...
The Associated Press is sharing data from the COVID Impact Survey, which provides statistics about physical health, mental health, economic security and social dynamics related to the coronavirus pandemic in the United States.
Conducted by NORC at the University of Chicago for the Data Foundation, the probability-based survey provides estimates for the United States as a whole, as well as in 10 states (California, Colorado, Florida, Louisiana, Minnesota, Missouri, Montana, New York, Oregon and Texas) and eight metropolitan areas (Atlanta, Baltimore, Birmingham, Chicago, Cleveland, Columbus, Phoenix and Pittsburgh).
The survey is designed to allow for an ongoing gauge of public perception, health and economic status to see what is shifting during the pandemic. When multiple sets of data are available, it will allow for the tracking of how issues ranging from COVID-19 symptoms to economic status change over time.
The survey is focused on three core areas of research:
Instead, use our queries linked below or statistical software such as R or SPSS to weight the data.
If you'd like to create a table to see how people nationally or in your state or city feel about a topic in the survey, use the survey questionnaire and codebook to match a question (the variable label) to a variable name. For instance, "How often have you felt lonely in the past 7 days?" is variable "soc5c".
Nationally: Go to this query and enter soc5c as the variable. Hit the blue Run Query button in the upper right hand corner.
Local or State: To find figures for that response in a specific state, go to this query and type in a state name and soc5c as the variable, and then hit the blue Run Query button in the upper right hand corner.
The resulting sentence you could write out of these queries is: "People in some states are less likely to report loneliness than others. For example, 66% of Louisianans report feeling lonely on none of the last seven days, compared with 52% of Californians. Nationally, 60% of people said they hadn't felt lonely."
The margin of error for the national and regional surveys is found in the attached methods statement. You will need the margin of error to determine if the comparisons are statistically significant. If the difference is:
The survey data will be provided under embargo in both comma-delimited and statistical formats.
Each set of survey data will be numbered and have the date the embargo lifts in front of it in the format of: 01_April_30_covid_impact_survey. The survey has been organized by the Data Foundation, a non-profit non-partisan think tank, and is sponsored by the Federal Reserve Bank of Minneapolis and the Packard Foundation. It is conducted by NORC at the University of Chicago, a non-partisan research organization. (NORC is not an abbreviation, it part of the organization's formal name.)
Data for the national estimates are collected using the AmeriSpeak Panel, NORC’s probability-based panel designed to be representative of the U.S. household population. Interviews are conducted with adults age 18 and over representing the 50 states and the District of Columbia. Panel members are randomly drawn from AmeriSpeak with a target of achieving 2,000 interviews in each survey. Invited panel members may complete the survey online or by telephone with an NORC telephone interviewer.
Once all the study data have been made final, an iterative raking process is used to adjust for any survey nonresponse as well as any noncoverage or under and oversampling resulting from the study specific sample design. Raking variables include age, gender, census division, race/ethnicity, education, and county groupings based on county level counts of the number of COVID-19 deaths. Demographic weighting variables were obtained from the 2020 Current Population Survey. The count of COVID-19 deaths by county was obtained from USA Facts. The weighted data reflect the U.S. population of adults age 18 and over.
Data for the regional estimates are collected using a multi-mode address-based (ABS) approach that allows residents of each area to complete the interview via web or with an NORC telephone interviewer. All sampled households are mailed a postcard inviting them to complete the survey either online using a unique PIN or via telephone by calling a toll-free number. Interviews are conducted with adults age 18 and over with a target of achieving 400 interviews in each region in each survey.Additional details on the survey methodology and the survey questionnaire are attached below or can be found at https://www.covid-impact.org.
Results should be credited to the COVID Impact Survey, conducted by NORC at the University of Chicago for the Data Foundation.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
# National Data on the relative frequency of given names in the population of U.S. births where the individual has a Social Security Number
For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to determine a name's rank. The first record for each sex has rank 1, the second record for each sex has rank 2, and so forth. To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.