The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
US Social Security applications are a great way to track trends in how babies born in the US are named.
Data.gov releases two datasets that are helplful for this: one at the national level and another at the state level. Note that only names with at least 5 babies born in the same year (/ state) are included in this dataset for privacy.
I've taken the raw files here and combined/normalized them into two CSV files (one for each dataset) as well as a SQLite database with two equivalently-defined tables. The code that did these transformations is available here.
New to data exploration in R? Take the free, interactive DataCamp course, "Data Exploration With Kaggle Scripts," to learn the basics of visualizing data with ggplot. You'll also create your first Kaggle Scripts along the way.
The Bureau of the Census has released Census 2000 Summary File 1 (SF1) 100-Percent data. The file includes the following population items: sex, age, race, Hispanic or Latino origin, household relationship, and household and family characteristics. Housing items include occupancy status and tenure (whether the unit is owner or renter occupied). SF1 does not include information on incomes, poverty status, overcrowded housing or age of housing. These topics will be covered in Summary File 3. Data are available for states, counties, county subdivisions, places, census tracts, block groups, and, where applicable, American Indian and Alaskan Native Areas and Hawaiian Home Lands. The SF1 data are available on the Bureau's web site and may be retrieved from American FactFinder as tables, lists, or maps. Users may also download a set of compressed ASCII files for each state via the Bureau's FTP server. There are over 8000 data items available for each geographic area. The full listing of these data items is available here as a downloadable compressed data base file named TABLES.ZIP. The uncompressed is in FoxPro data base file (dbf) format and may be imported to ACCESS, EXCEL, and other software formats. While all of this information is useful, the Office of Community Planning and Development has downloaded selected information for all states and areas and is making this information available on the CPD web pages. The tables and data items selected are those items used in the CDBG and HOME allocation formulas plus topics most pertinent to the Comprehensive Housing Affordability Strategy (CHAS), the Consolidated Plan, and similar overall economic and community development plans. The information is contained in five compressed (zipped) dbf tables for each state. When uncompressed the tables are ready for use with FoxPro and they can be imported into ACCESS, EXCEL, and other spreadsheet, GIS and database software. The data are at the block group summary level. The first two characters of the file name are the state abbreviation. The next two letters are BG for block group. Each record is labeled with the code and name of the city and county in which it is located so that the data can be summarized to higher-level geography. The last part of the file name describes the contents . The GEO file contains standard Census Bureau geographic identifiers for each block group, such as the metropolitan area code and congressional district code. The only data included in this table is total population and total housing units. POP1 and POP2 contain selected population variables and selected housing items are in the HU file. The MA05 table data is only for use by State CDBG grantees for the reporting of the racial composition of beneficiaries of Area Benefit activities. The complete package for a state consists of the dictionary file named TABLES, and the five data files for the state. The logical record number (LOGRECNO) links the records across tables.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a very small but useful dataset if you are ever looking to get jobs for a certain US city in LinkedIn. It contains a list of US cities and states and it's corresponding LinkedIn ID (which is usually externally hidden).
The cities list was retreived from here: https://github.com/kelvins/US-Cities-Database and the names of the ciiadjusted to match the name used in LinkedIn (which could differ in subtle ways).
Some cities do not have an ID, this is because the city is either too small or because there was a difference in the name on LinkedIn which I did not detect (human error). If you ever run in to one of these feel free to enhance this dataset.
NOTE: A more current version of the Protected Areas Database of the United States (PAD-US) is available: PAD-US 3.0 https://doi.org/10.5066/P9Q9LQ4B. The USGS Protected Areas Database of the United States (PAD-US) is the nation's inventory of protected areas, including public land and voluntarily provided private protected areas, identified as an A-16 National Geospatial Data Asset in the Cadastre Theme (https://communities.geoplatform.gov/ngda-cadastre/). The PAD-US is an ongoing project with several published versions of a spatial database including areas dedicated to the preservation of biological diversity, and other natural (including extraction), recreational, or cultural uses, managed for these purposes through legal or other effective means. The database was originally designed to support biodiversity assessments; however, its scope expanded in recent years to include all public and nonprofit lands and waters. Most are public lands owned in fee (the owner of the property has full and irrevocable ownership of the land); however, long-term easements, leases, agreements, Congressional (e.g. 'Wilderness Area'), Executive (e.g. 'National Monument'), and administrative designations (e.g. 'Area of Critical Environmental Concern') documented in agency management plans are also included. The PAD-US strives to be a complete inventory of public land and other protected areas, compiling “best available” data provided by managing agencies and organizations. The PAD-US geodatabase maps and describes areas using over twenty-five attributes and five feature classes representing the U.S. protected areas network in separate feature classes: Fee (ownership parcels), Designation, Easement, Marine, Proclamation and Other Planning Boundaries. Five additional feature classes include various combinations of the primary layers (for example, Combined_Fee_Easement) to support data management, queries, web mapping services, and analyses. This PAD-US Version 2.1 dataset includes a variety of updates and new data from the previous Version 2.0 dataset (USGS, 2018 https://doi.org/10.5066/P955KPLE ), achieving the primary goal to "Complete the PAD-US Inventory by 2020" (https://www.usgs.gov/core-science-systems/science-analytics-and-synthesis/gap/science/pad-us-vision) by addressing known data gaps with newly available data. The following list summarizes the integration of "best available" spatial data to ensure public lands and other protected areas from all jurisdictions are represented in PAD-US, along with continued improvements and regular maintenance of the federal theme. Completing the PAD-US Inventory: 1) Integration of over 75,000 city parks in all 50 States (and the District of Columbia) from The Trust for Public Land's (TPL) ParkServe data development initiative (https://parkserve.tpl.org/) added nearly 2.7 million acres of protected area and significantly reduced the primary known data gap in previous PAD-US versions (local government lands). 2) First-time integration of the Census American Indian/Alaskan Native Areas (AIA) dataset (https://www2.census.gov/geo/tiger/TIGER2019/AIANNH) representing the boundaries for federally recognized American Indian reservations and off-reservation trust lands across the nation (as of January 1, 2020, as reported by the federally recognized tribal governments through the Census Bureau's Boundary and Annexation Survey) addressed another major PAD-US data gap. 3) Aggregation of nearly 5,000 protected areas owned by local land trusts in 13 states, aggregated by Ducks Unlimited through data calls for easements to update the National Conservation Easement Database (https://www.conservationeasement.us/), increased PAD-US protected areas by over 350,000 acres. Maintaining regular Federal updates: 1) Major update of the Federal estate (fee ownership parcels, easement interest, and management designations), including authoritative data from 8 agencies: Bureau of Land Management (BLM), U.S. Census Bureau (Census), Department of Defense (DOD), U.S. Fish and Wildlife Service (FWS), National Park Service (NPS), Natural Resources Conservation Service (NRCS), U.S. Forest Service (USFS), National Oceanic and Atmospheric Administration (NOAA). The federal theme in PAD-US is developed in close collaboration with the Federal Geographic Data Committee (FGDC) Federal Lands Working Group (FLWG, https://communities.geoplatform.gov/ngda-govunits/federal-lands-workgroup/); 2) Complete National Marine Protected Areas (MPA) update: from the National Oceanic and Atmospheric Administration (NOAA) MPA Inventory, including conservation measure ('GAP Status Code', 'IUCN Category') review by NOAA; Other changes: 1) PAD-US field name change - The "Public Access" field name changed from 'Access' to 'Pub_Access' to avoid unintended scripting errors associated with the script command 'access'. 2) Additional field - The "Feature Class" (FeatClass) field was added to all layers within PAD-US 2.1 (only included in the "Combined" layers of PAD-US 2.0 to describe which feature class data originated from). 3) Categorical GAP Status Code default changes - National Monuments are categorically assigned GAP Status Code = 2 (previously GAP 3), in the absence of other information, to better represent biodiversity protection restrictions associated with the designation. The Bureau of Land Management Areas of Environmental Concern (ACECs) are categorically assigned GAP Status Code = 3 (previously GAP 2) as the areas are administratively protected, not permanent. More information is available upon request. 4) Agency Name (FWS) geodatabase domain description changed to U.S. Fish and Wildlife Service (previously U.S. Fish & Wildlife Service). 5) Select areas in the provisional PAD-US 2.1 Proclamation feature class were removed following a consultation with the data-steward (Census Bureau). Tribal designated statistical areas are purely a geographic area for providing Census statistics with no land base. Most affected areas are relatively small; however, 4,341,120 acres and 37 records were removed in total. Contact Mason Croft (masoncroft@boisestate) for more information about how to identify these records. For more information regarding the PAD-US dataset please visit, https://usgs.gov/gapanalysis/PAD-US/. For more information about data aggregation please review the Online PAD-US Data Manual available at https://www.usgs.gov/core-science-systems/science-analytics-and-synthesis/gap/pad-us-data-manual .
This is an updated version of the national baby names database, containing records from 1880-2017.
https://en.wikipedia.org/wiki/Public_domainhttps://en.wikipedia.org/wiki/Public_domain
This dataset is part of the Geographical repository maintained by Opendatasoft. This dataset contains data for counties and equivalent entities in United States of America. The primary legal divisions of most states are termed counties. In Louisiana, these divisions are known as parishes. In Alaska, which has no counties, the equivalent entities are the organized boroughs, city and boroughs, municipalities, and for the unorganized area, census areas. The latter are delineated cooperatively for statistical purposes by the State of Alaska and the Census Bureau. In four states (Maryland, Missouri, Nevada, and Virginia), there are one or more incorporated places that are independent of any county organization and thus constitute primary divisions of their states. These incorporated places are known as independent cities and are treated as equivalent entities for purposes of data presentation. The District of Columbia and Guam have no primary divisions, and each area is considered an equivalent entity for purposes of data presentation. The Census Bureau treats the following entities as equivalents of counties for purposes of data presentation: Municipios in Puerto Rico, Districts and Islands in American Samoa, Municipalities in the Commonwealth of the Northern Mariana Islands, and Islands in the U.S. Virgin Islands. The entire area of the United States, Puerto Rico, and the Island Areas is covered by counties or equivalent entities.Processors and tools are using this data. Enhancements Add ISO 3166-3 codes. Simplify geometries to provide better performance across the services. Add administrative hierarchy.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A text database of named places in the United States. The US Board on Geographic Names controls the database of official names of places in the US, and the US Geological Survey (USGS) maintains the database. This is a copy of the 2017-06-01 database, which I am using to create an R package for textual analyses of geographic content, to ensure this version remains. The original source was from: https://geonames.usgs.gov/domestic/download_data.htm (which is a very slow server).I have chosen CC0 for the license because, as a creation of the US government, I don't think the database can be copyrighted (and CC0 is the closest match).
https://leadsdeposit.com/restaurant-database/https://leadsdeposit.com/restaurant-database/
Dataset of 700,000 restaurants in the United States complete with detailed contact and geolocation data. The database includes multiple data points such as restaurant name, address, phone number, website, email, opening hours, latitude, longitude, and cuisine.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
List of each U.S. state Code, Name, Abbreviation and Alpha code, useful to convert one to the other.
The Alesco Phone ID Database data ties together a consumer's true identity, and with linkage to the Alesco Power Identity Graph, we are perfectly positioned to help customers solve today's most challenging marketing, analytics, and identity resolution problems.
Our proprietary Phone ID database combines public and private sources and validates phone numbers against current and historical data 24 hours a day, 365 days a year.
With over 650 million unique phone numbers, device and service information, our one-of-a-kind solutions are now available for your marketing and identity resolution challenges in both B2C and B2B applications!
• Alesco Phone ID provides more than 860 million phone numbers monthly linked to a consumer or business name and includes landline, mobile phone number, VoIP, private and business phone numbers — all permissibly obtained and privacy-compliant and linked to other Alesco data sets
• How we do it: Alesco Phone ID is multi-sourced with daily information and delivered monthly or quarterly to clients. Our proprietary machine learning and advanced analytics processes ensure quality levels far above industry standards. Alesco processes over 100 million phone signals per day, compiling, normalizing, and standardizing phone information from 37 input sources.
• Accuracy: Each of Alesco’s phone data sources are vetted to ensure they are authoritative, giving you confidence in the accuracy of the information. Every record is validated, verified and processed to ensure the widest, most reliable coverage combined with stunning precision.
Ease of use: Alesco’s Phone ID Database is available as an on-premise phone database license, giving you full control to host and access this powerful resource on-site. Ongoing updates are provided on a monthly basis ensure your data is up to date.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Chinese name components, accompanied by accurate pinyin readings, gender codes, and flags denoting whether name is a given name, surname, or both.
https://en.wikipedia.org/wiki/Public_domainhttps://en.wikipedia.org/wiki/Public_domain
This dataset is part of the Geographical repository maintained by Opendatasoft. This dataset contains data for places and equivalent entities in United States of America.This layer both incorporated places (legal entities) and census designated places or CDPs (statistical entities). An incorporated place is established to provide governmental functions for a concentration of people as opposed to a minor civil division (MCD), which generally is created to provide services or administer an area without regard, necessarily, to population. Places always nest within a state, but may extend across county and county subdivision boundaries. An incorporated place usually is a city, town, village, or borough, but can have other legal descriptions. CDPs are delineated for the decennial census as the statistical counterparts of incorporated places. CDPs are delineated to provide data for settled concentrations of population that are identifiable by name, but are not legally incorporated under the laws of the state in which they are located. The boundaries for CDPs often are defined in partnership with state, local, and/or tribal officials and usually coincide with visible features or the boundary of an adjacent incorporated place or another legal entity. CDP boundaries often change from one decennial census to the next with changes in the settlement pattern and development; a CDP with the same name as in an earlier census does not necessarily have the same boundary. The only population/housing size requirement for CDPs is that they must contain some housing and population. Processors and tools are using this data. Enhancements Add ISO 3166-3 codes. Simplify geometries to provide better performance across the services. Add administrative hierarchy.
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This resource covers four million Japanese names and their romanized variants, and includes gender codes, classification codes, and frequency rankings.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
French First Names from Death Records (1970-2024)
This dataset contains French first names extracted from death records provided by INSEE (French National Institute of Statistics and Economic Studies) covering the period from 1970 to September 2024.
Dataset Description
Data Source
The data is sourced from INSEE's death records database. It includes first names of deceased individuals in France, providing valuable insights into naming patterns across different… See the full description on the dataset page: https://huggingface.co/datasets/eltorio/french_first_names_insee_2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A list of USA states and two-letter abbreviations for coding.
URL from idinfo/citation in CSDGM metadata.
This layer shows race and ethnicity data from the 2020 Census Demographic and Housing Characteristics. This is shown by Nation, Consolidated City, Census Designated Place, Incorporated Place boundaries. Each geography layer contains a common set of Census counts based on available attributes from the U.S. Census Bureau. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. To see the full list of attributes available in this service, go to the "Data" tab above, and then choose "Fields" at the top right. Each attribute contains definitions, additional details, and the formula for calculated fields in the field description.Vintage of boundaries and attributes: 2020 Demographic and Housing Characteristics Table(s): P1, H1, H3, P5, P9 Data downloaded from: U.S. Census Bureau’s data.census.gov siteDate the Data was Downloaded: May 25, 2023Geography Levels included: Nation, Consolidated City, Census Designated Place, Incorporated PlaceNational Figures: included in Nation layer The United States Census Bureau Demographic and Housing Characteristics: 2020 Census Results 2020 Census Data Quality Geography & 2020 Census Technical Documentation Data Table Guide: includes the final list of tables, lowest level of geography by table and table shells for the Demographic Profile and Demographic and Housing Characteristics.News & Updates This layer is ready to be used in ArcGIS Pro, ArcGIS Online and its configurable apps, Story Maps, dashboards, Notebooks, Python, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the U.S. Census Bureau when using this data. Data Processing Notes: These 2020 Census boundaries come from the US Census TIGER geodatabases. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For Census tracts and block groups, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract and block group boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are unchanged and available as attributes within the data table (units are square meters). The layer contains all US states, Washington D.C., and Puerto Rico. Census tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99). Block groups that fall within the same criteria (Block Group denoted as 0 with no area land) have also been removed.Percentages and derived counts, are calculated values (that can be identified by the "_calc_" stub in the field name). Field alias names were created based on the Table Shells file available from the Data Table Guide for the Demographic Profile and Demographic and Housing Characteristics. Not all lines of all tables listed above are included in this layer. Duplicative counts were dropped. For example, P0030001 was dropped, as it is duplicative of P0010001.To protect the privacy and confidentiality of respondents, their data has been protected using differential privacy techniques by the U.S. Census Bureau.
Official Street Names in the City of Los Angeles created and maintained by the Bureau of Engineering.
USA Census Block Groups (CBG) for Urban Search and Rescue. This layer can be used for search segment planning. Block groups generally contain between 600 and 5,000 people and the boundaries generally follow existing roads and waterways. The field segment_designation is the last 6 digits of the unique identifier and matches the field in the SARCOP Segment layer.Data download date: August 12, 2021Census tables: P1, P2, P3, P4, H1, P5, HeaderDownloaded from: Census FTP siteProcessing Notes:Data was downloaded from the U.S. Census Bureau FTP site, imported into SAS format and joined to the 2020 TIGER boundaries. Boundaries are sourced from the 2020 TIGER/Line Geodatabases. Boundaries have been projected into Web Mercator and each attribute has been given a clear descriptive alias name. No alterations have been made to the vertices of the data.Each attribute maintains it's specified name from Census, but also has a descriptive alias name and long description derived from the technical documentation provided by the Census. For a detailed list of the attributes contained in this layer, view the Data tab and select "Fields". The following alterations have been made to the tabular data:Joined all tables to create one wide attribute table:P1 - RaceP2 - Hispanic or Latino, and not Hispanic or Latino by RaceP3 - Race for the Population 18 Years and OverP4 - Hispanic or Latino, and not Hispanic or Latino by Race for the Population 18 Years and OverH1 - Occupancy Status (Housing)P5 - Group Quarters Population by Group Quarters Type (correctional institutions, juvenile facilities, nursing facilities/skilled nursing, college/university student housing, military quarters, etc.)HeaderAfter joining, dropped fields: FILEID, STUSAB, CHARITER, CIFSN, LOGRECNO, GEOVAR, GEOCOMP, LSADC, and BLOCK.GEOCOMP was renamed to GEOID and moved be the first column in the table, the original GEOID was dropped.Placeholder fields for future legislative districts have been dropped: CD118, CD119, CD120, CD121, SLDU22, SLDU24, SLDU26, SLDU28, SLDL22, SLDL24 SLDL26, SLDL28.P0020001 was dropped, as it is duplicative of P0010001. Similarly, P0040001 was dropped, as it is duplicative of P0030001.In addition to calculated fields, County_Name and State_Name were added.The following calculated fields have been added (see long field descriptions in the Data tab for formulas used): PCT_P0030001: Percent of Population 18 Years and OverPCT_P0020002: Percent Hispanic or LatinoPCT_P0020005: Percent White alone, not Hispanic or LatinoPCT_P0020006: Percent Black or African American alone, not Hispanic or LatinoPCT_P0020007: Percent American Indian and Alaska Native alone, not Hispanic or LatinoPCT_P0020008: Percent Asian alone, Not Hispanic or LatinoPCT_P0020009: Percent Native Hawaiian and Other Pacific Islander alone, not Hispanic or LatinoPCT_P0020010: Percent Some Other Race alone, not Hispanic or LatinoPCT_P0020011: Percent Population of Two or More Races, not Hispanic or LatinoPCT_H0010002: Percent of Housing Units that are OccupiedPCT_H0010003: Percent of Housing Units that are VacantPlease note these percentages might look strange at the individual block group level, since this data has been protected using differential privacy.* *To protect the privacy and confidentiality of respondents, data has been protected using differential privacy techniques by the U.S. Census Bureau. This means that some individual block groups will have values that are inconsistent or improbable. However, when aggregated up, these issues become minimized. The pop-up on this layer uses Arcade to display aggregated values for the surrounding area rather than values for the block group itself.Download Census redistricting data in this layer as a file geodatabase.Additional links:U.S. Census BureauU.S. Census Bureau Decennial CensusAbout the 2020 Census2020 Census2020 Census data qualityDecennial Census P.L. 94-171 Redistricting Data Program
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.