https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
171 million names (100 million unique) This torrent contains: The URL of every searchable Facebook user s profile The name of every searchable Facebook user, both unique and by count (perfect for post-processing, datamining, etc) Processed lists, including first names with count, last names with count, potential usernames with count, etc The programs I used to generate everything So, there you have it: lots of awesome data from Facebook. Now, I just have to find one more problem with Facebook so I can write "Revenge of the Facebook Snatchers" and complete the trilogy. Any suggestions? >:-) Limitations So far, I have only indexed the searchable users, not their friends. Getting their friends will be significantly more data to process, and I don t have those capabilities right now. I d like to tackle that in the future, though, so if anybody has any bandwidth they d like to donate, all I need is an ssh account and Nmap installed. An additional limitation is that these are on
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains US baby names from the Social Security Administration dating back to 1879. With over 150 years of data, this is one of the most comprehensive datasets on baby names in the US. The data includes the name, year of birth, sex, and number of babies with that name for each year. This dataset is a great resource for anyone interested in studying baby naming trends over time
This dataset is a compilation of over 140 years of data from the Social Security Administration. It includes data on baby names, year of birth, and sex. There are also columns for the number of babies with that name born in that year.
This dataset can be used to track changes in baby naming trends over time, or to study how popular names have changed in popularity. It can also be used to study how naming trends differ between sexes, or between different years
This dataset could be used for a number of things, including: 1. Determining baby name trends over time 2. Finding out what the most popular baby names are in the US 3. Analyzing how baby name popularity has changed over the years
If you use this dataset in your research, please credit @nickgott, @rflprr and the Social Security Administration via Data.gov
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Motivation: creating challenging dataset for testing Named-Entity
Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods
Entities were collected as Wikipedia
text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.
The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).
Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.
Usage NotesEntities:
File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.
News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.
Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.
Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".
The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.
The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
We provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains statistics on names (first names of women, first names of men, and last names) by country of birth. In total, there are 231,505 names by 202 countries. The data comes from Statistics Sweden's population statistics (name register) and refers to persons registered in Sweden on December 31st, 2020. However, some names are excluded due to confidentiality, such as names with fewer than five carriers. The data is licensed with Creative Commons Attribution 4.0 International (CC BY 4.0) and may be used as long as Statistics Sweden is stated as the source. In this dataset, you will also find (in addition to the original data from Statistics Sweden) tidied data where the ISO code for each country has been added, as well as data in so-called wide format and long format to facilitate easier data processing.
Please see the Swedish version of the post and the README file for more information about the data.
By Derek Howard [source]
This dataset provides an essential tool for generating gender-specific datasets from names alone. It contains information on the probability of a person's name belonging to a certain gender, based off of US Social Security records from the last century. This makes it easy to assign genders to datasets that do not natively include this data. All probability values were culled from records with 5 or more people associated with each name - so those individuals with less common monikers can still have their genders correctly predicted! With this resource, users can generate gender-aware data in no time, making gender identification in data sets more accurate and easier than ever
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a helpful resource when you need to accurately identify gender from names. With this dataset, you’ll be able to quickly and accurately assign genders to datasets that contain names but no other information about the person.
To get started, you will need a csv file with two columns: name and probability. The name column should contain the first names of the people in your dataset. The probability column should contain numbers between 0 and 1 indicating the likelihood that each name is associated with one specific gender (0 for male, 1 for female).
In addition to simply assigning genders from these probabilities alone, users of this dataset also have more control over their classifications - they can use it as either a baseline or as an absolute measure of accuracy depending on their exact needs/preferences. Experimentation is highly encouraged here!
Good luck!
Create gender-specific applications - tailor different apps to different genders based on the probability of a particular name belonging to a certain gender.
Generate gender neutral names - use this data to generate random names with no gender bias.
Automate record lookup - quickly and accurately assign genders based on the probability associated with their name
If you use this dataset in your research, please credit the original authors.
License
Unknown License - Please check the dataset description for more information.
File: name_gender.csv | Column name | Description | |:----------------|:--------------------------------------------------------------------| | name | The name of the person. (String) | | gender | The gender of the person. (String) | | probability | The probability of the gender being assigned to the person. (Float) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Derek Howard.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data contains popular baby names in New York .
Dataset :- 1 file (popular-baby-names.csv)
Columns - Year of Birth : Year of the baby's birth. - Gender : Gender of the baby. - Ethnicity : Types of ethnicity they belong to. - Child's First Name : The first name of the child. - Count : How many babies were named . - Ranking : Ranking of that name.
This dataset contains ranks and counts for the top 25 baby names by sex for live births that occurred in California (by occurrence) based on information entered on birth certificates.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2 rows and is filtered where the book is Investing in People : The Economics of Population Quality. It features 7 columns including author, publication date, language, and book publisher.
By Andy Kriebel [source]
The file contains data on births in the United States from 1994 to 2014. The data includes the following columns: year: The year of the observation. (Integer) month: The month of the observation. (Integer) date_of_month: The date of the observation. (Integer) day_of_week: The day of the week of the observation. (Integer) births: The number of births on the given day. (Integer)
The US Births dataset on Kaggle contains data on births in the United States from 1994 to 2014. The data is broken down by year, month, date of month, day of week, and births.
This dataset can be used to answer questions about when people are born, how common certain birthdays are, and any trends over time. For example, you could use this dataset to find out which day of the week has the most births or which month has the most births
- Determining which day of the year and what time of day that people are mostly born to help with staffing levels in maternity wards
- Identifying trends in baby names over time
- Predicting the number of births on a given day
This data set is a combined effort of the U.S. National Center for Health Statistics and the U.S. Social Security Administration, provided by FiveThirtyEight. It contains data on births in the United States from 1994 to 2014, with the following columns: year, month, date_of_month, day_of_week, births
->Thank you to FiveThirtyEight for providing this dataset!
License
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: US_births_1994-2014.csv | Column name | Description | |:------------------|:---------------------------------------------| | year | Year of the data. (Integer) | | month | Month of the data. (Integer) | | date_of_month | Day of the month of the data. (Integer) | | day_of_week | Day of the week of the data. (Integer) | | births | Number of births on the given day. (Integer) |
If you use this dataset in your research, please credit Andy Kriebel.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about politicians. It has 53 rows and is filtered where the political party is People S National Congress (Papua New Guinea). It features 7 columns including birth date, death date, country, and gender.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.
The Bureau of the Census has released Census 2000 Summary File 1 (SF1) 100-Percent data. The file includes the following population items: sex, age, race, Hispanic or Latino origin, household relationship, and household and family characteristics. Housing items include occupancy status and tenure (whether the unit is owner or renter occupied). SF1 does not include information on incomes, poverty status, overcrowded housing or age of housing. These topics will be covered in Summary File 3. Data are available for states, counties, county subdivisions, places, census tracts, block groups, and, where applicable, American Indian and Alaskan Native Areas and Hawaiian Home Lands. The SF1 data are available on the Bureau's web site and may be retrieved from American FactFinder as tables, lists, or maps. Users may also download a set of compressed ASCII files for each state via the Bureau's FTP server. There are over 8000 data items available for each geographic area. The full listing of these data items is available here as a downloadable compressed data base file named TABLES.ZIP. The uncompressed is in FoxPro data base file (dbf) format and may be imported to ACCESS, EXCEL, and other software formats. While all of this information is useful, the Office of Community Planning and Development has downloaded selected information for all states and areas and is making this information available on the CPD web pages. The tables and data items selected are those items used in the CDBG and HOME allocation formulas plus topics most pertinent to the Comprehensive Housing Affordability Strategy (CHAS), the Consolidated Plan, and similar overall economic and community development plans. The information is contained in five compressed (zipped) dbf tables for each state. When uncompressed the tables are ready for use with FoxPro and they can be imported into ACCESS, EXCEL, and other spreadsheet, GIS and database software. The data are at the block group summary level. The first two characters of the file name are the state abbreviation. The next two letters are BG for block group. Each record is labeled with the code and name of the city and county in which it is located so that the data can be summarized to higher-level geography. The last part of the file name describes the contents . The GEO file contains standard Census Bureau geographic identifiers for each block group, such as the metropolitan area code and congressional district code. The only data included in this table is total population and total housing units. POP1 and POP2 contain selected population variables and selected housing items are in the HU file. The MA05 table data is only for use by State CDBG grantees for the reporting of the racial composition of beneficiaries of Area Benefit activities. The complete package for a state consists of the dictionary file named TABLES, and the five data files for the state. The logical record number (LOGRECNO) links the records across tables.
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
This release presents experimental statistics from the Mental Health Services Data Set (MHSDS), using final submissions for April 2016 and provisional submissions for May 2016. This is the fifth monthly release from the dataset, which replaces the Mental Health and Learning Disabilities Dataset (MHLDDS). As well as analysis of waiting times, first published in March 2016, this release includes elements of the reports that were previously included in monthly reports produced from final MHLDDS submissions. In this publication a new data file has been produced to present the data for people identified as having learning disabilities and/or autistic spectrum disorder (LDA) characteristics. Because of the scope of the changes to the dataset (resulting in the name change to MHSDS and the new name for these monthly reports) it will take time to re-introduce all possible measures that were previously part of the MHLDS Monthly Reports. Additional measures will be added to this report in the coming months. Further details about these changes and the consultation that informed were announced in November. From January 2016 the release includes information on people in children and young people's mental health services, including CAMHS, for the first time. Learning disabilities and autism services have been included since September 2014. This release of final data for April 2016 comprises: - An Executive Summary, which presents national-level analysis across the whole dataset and also for some specific service areas and age groups - Data tables about access and waiting times in mental health services for the based on provisional data for the period 1 March 2016 to 31 May 2016. - A monthly data file which presents 92 measures for mental health, learning disability and autism services at National, Provider and Clinical Commissioning Group (CCG) level. - A Currency and Payments (CAP) data file, containing three measures relating to people assigned to Adult Mental Health Care Clusters. Further measures will be added in future releases. - A data file containing the measures relating to people with learning disabilities and/or autism. - Exploratory analysis of the coverage and completeness of access and waiting times statistics for people entering the Early Intervention in Psychosis pathway. - A set of provider level data quality measures for both months. The report comprises of validity measures for various data items at National and Provider level. From the publication of April data, a coverage report is included showing the number of providers submitting each month and number of records submitted. - A metadata file, which provide contextual information for each measure, including a full description, current uses, method used for analysis and some notes on usage. We will release the reports as experimental statistics until the characteristics of data flowed using the new data standard are understood. A correction has been made to this publication on 10 September 2018. This amendment relates to statistics in the monthly CSV data file; the specific measures effected are listed in the “Corrected Measures” CSV. All listed measures have now been corrected. NHS Digital apologises for any inconvenience caused.
The information presented in this data set is based on records of dockets, petitions, tower share requests, and notices of exempt modifications received and processed by the Council. This database is not an exhaustive listing of all wireless telecommunications sites in the state in that it does not include all information about sites not under the jurisdiction of the Siting Council. The dataset includes a row for each Council action on any given facility. Although the Connecticut Siting Council makes every effort to keep this spreadsheet current and accurate, the Council makes no representation or warranty as to the accuracy of the data presented herein. The public is advised that the records upon which the information in this database is based are kept in the Siting Council’s offices at Ten Franklin Square, New Britain and are open for public inspection during normal working hours from 8:30 a.m. to 4:30 p.m. Monday through Friday. Note to Users: Over the years, some of the wireless companies have had several different corporate identities. In the database, they are identified by the name they had at the time of their application to the Siting Council. To help database users follow the name changes, the list below shows the different names by which the companies have been known. Recent mergers in the telecommunications industry have joined companies listed as separate entities. AT&T Wireless merged with Cingular to do business as New Cingular. Sprint and Nextel have merged to form Sprint/Nextel Corporation. Cingular: SNET, SCLP, and New Cingular after merger with AT&T T-Mobile: Omni (Omnipoint), VoiceStream Verizon: BAM, Cellco AT&T: AT&T Wireless, New Cingular after merger with Cingular, then Cingular rebranded as AT&T Nextel: Smart SMR
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical open data on LAU regions of Slovakia, Czech Republic, Poland, Hungary (and other countries in the future). LAU1 regions are called counties, okres, okresy, powiat, járás, járási, NUTS4, LAU, Local Administrative Units, ... and there are 733 of them in this V4 dataset. Overall, we cover 733 regions which are described by 137.828 observations (panel data rows) and more than 1.760.229 data points.
This LAU dataset contains panel data on population, on age structure of inhabitants, on number and on structure of registered unemployed. Dataset prepared by Michal Páleník. Output files are in json, shapefiles, xls, ods, json, topojson or CSV formats. Downloadable at zenodo.org.
This dataset consists of:
data on unemployment (by gender, education and duration of unemployment),
data on vacancies,
open data on population in Visegrad counties (by age and gender),
data on unemployment share.
Combined latest dataset
dataset of the latest available data on unemployment, vacancies and population
dataset includes map contours (shp, topojson or geojson format), relation id in OpenStreetMap, wikidata entry code,
it also includes NUTS4 code, LAU1 code used by national statistical office and abbreviation of the region (usually license plate),
source of map contours is OpenStreetMap, licensed under ODbL
no time series, only most recent data on population and unemployment combined in one output file
columns: period, lau, name, registered_unemployed, registered_unemployed_females, disponible_unemployed, low_educated, long_term, unemployment_inflow, unemployment_outflow, below_25, over_55, vacancies, pop_period, TOTAL, Y15-64, Y15-64-females, local_lau, osm_id, abbr, wikidata, population_density, area_square_km, way
Slovakia – SK: 79 LAU1 regions, data for 2024-10-01, 1.659 data,
Czech Republic – CZ: 77 LAU1 regions, data for 2024-10-01, 1.617 data,
Poland – PL: 380 LAU1 regions, data for 2024-09-01, 6.840 data,
Hungary – HU: 197 LAU1 regions, data for 2024-10-01, 2.955 data,
13.071 data in total.
column/number of observations description SK CZ PL HU
period period (month and year) the data is for 79 77 380 197
lau LAU code of the region 79 77 380 197
name name of the region in local language 79 77 380 197
registered_unemployed number of unemployed registered at labour offices 79 77 380 197
registered_unemployed_females number of unemployed women 79 77 380 197
disponible_unemployed unemployed able to accept job offer 79 77 0 0
low_educated unmployed without secondary school (ISCED 0 and 1) 79 77 380 197
long_term unemployed for longer than 1 year 79 77 380 0
unemployment_inflow inflow into unemployment 79 77 0 0
unemployment_outflow outflow from unemployment 79 77 0 0
below_25 number of unemployed below 25 years of age 79 77 380 197
over_55 unemployed older than 55 years 79 77 380 197
vacancies number of vacancies reported by labour offices 79 77 380 0
pop_period date of population data 79 77 380 197
TOTAL total population 79 77 380 197
Y15-64 number of people between 15 and 64 years of age, population in economically active age 79 77 380 197
Y15-64-females number of women between 15 and 64 years of age 79 77 380 197
local_lau region's code used by local labour offices 79 77 380 197
osm_id relation id in OpenStreetMap database 79 77 380 197
abbr abbreviation used for this region 79 77 380 0
wikidata wikidata identification code 79 77 380 197
population_density population density 79 77 380 197
area_square_km area of the region in square kilometres 79 77 380 197
way geometry, polygon of given region 79 77 380 197
Unemployment dataset
time series of unemployment data in Visegrad regions
by gender, duration of unemployment, education level, age groups, vacancies,
columns: period, lau, name, registered_unemployed, registered_unemployed_females, disponible_unemployed, low_educated, long_term, unemployment_inflow, unemployment_outflow, below_25, over_55, vacancies
Slovakia – SK: 79 LAU1 regions, data for 334 periods (1997-01-01 ... 2024-10-01), 202.082 data,
Czech Republic – CZ: 77 LAU1 regions, data for 244 periods (2004-07-01 ... 2024-10-01), 147.528 data,
Poland – PL: 380 LAU1 regions, data for 189 periods (2005-03-01 ... 2024-09-01), 314.100 data,
Hungary – HU: 197 LAU1 regions, data for 106 periods (2016-01-01 ... 2024-10-01), 104.408 data,
768.118 data in total.
column/number of observations description SK CZ PL HU
period period (month and year) the data is for 26 386 18 788 71 772 20 882
lau LAU code of the region 26 386 18 788 71 772 20 882
name name of the region in local language 26 386 18 788 71 772 20 882
registered_unemployed number of unemployed registered at labour offices 26 386 18 788 71 772 20 882
registered_unemployed_females number of unemployed women 26 386 18 788 62 676 20 882
disponible_unemployed unemployed able to accept job offer 25 438 18 788 0 0
low_educated unmployed without secondary school (ISCED 0 and 1) 11 771 9855 41 388 20 881
long_term unemployed for longer than 1 year 24 253 9855 41 388 0
unemployment_inflow inflow into unemployment 26 149 16 478 0 0
unemployment_outflow outflow from unemployment 26 149 16 478 0 0
below_25 number of unemployed below 25 years of age 11 929 9855 17 100 20 881
over_55 unemployed older than 55 years 11 929 9855 17 100 20 882
vacancies number of vacancies reported by labour offices 11 692 18 788 62 676 0
Population dataset
time series on population by gender and 5 year age groups in V4 counties
columns: period, lau, name, gender, TOTAL, Y00-04, Y05-09, Y10-14, Y15-19, Y20-24, Y25-29, Y30-34, Y35-39, Y40-44, Y45-49, Y50-54, Y55-59, Y60-64, Y65-69, Y70-74, Y75-79, Y80-84, Y85-89, Y90-94, Y_GE95, Y15-64
Slovakia – SK: 79 LAU1 regions, data for 28 periods (1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 152.628 data,
Czech Republic – CZ: 78 LAU1 regions, data for 24 periods (2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 125.862 data,
Poland – PL: 382 LAU1 regions, data for 29 periods (1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 626.941 data,
Hungary – HU: 197 LAU1 regions, data for 11 periods (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 86.680 data,
992.111 data in total.
column/number of observations description SK CZ PL HU
period period (month and year) the data is for 6636 5574 32 883 4334
lau LAU code of the region 6636 5574 32 883 4334
name name of the region in local language 6636 5574 32 883 4334
gender gender (male or female) 6636 5574 32 883 4334
TOTAL total population 6636 5574 32 503 4334
Y00-04 inhabitants between 00 to 04 years inclusive 6636 5574 32 503 4334
Y05-09 number of inhabitants between 05 to 09 years of age 6636 5574 32 503 4334
Y10-14 number of people between 10 to 14 years inclusive 6636 5574 32 503 4334
Y15-19 number of inhabitants between 15 to 19 years of age 6636 5574 32 503 4334
Y20-24 number of people between 20 to 24 years inclusive 6636 5574 32 503 4334
Y25-29 number of inhabitants between 25 to 29 years of age 6636 5574 32 503 4334
Y30-34 inhabitants between 30 to 34 years inclusive 6636 5574 32 503 4334
Y35-39 number of inhabitants between 35 to 39 years of age 6636 5574 32 503 4334
Y40-44 inhabitants between 40 to 44 years inclusive 6636 5574 32 503 4334
Y45-49 number of inhabitants younger than 49 and older than 45 years 6636 5574 32 503 4334
Y50-54 inhabitants between 50 to 54 years inclusive 6636 5574 32 503 4334
Y55-59 number of inhabitants between 55 to 59 years of age 6636 5574 32 503 4334
Y60-64 inhabitants between 60 to 64 years inclusive 6636 5574 32 503 4334
Y65-69 number of inhabitants younger than 69 and older than 65 years 6636 5574 32 503 4334
Y70-74 inhabitants between 70 to 74 years inclusive 6636 5574 24 670 4334
Y75-79 number of inhabitants between 75 to 79 years of age 6636 5574 24 670 4334
Y80-84 number of people between 80 to 84 years inclusive 6636 5574 24 670 4334
Y85-89 number of inhabitants younger than 89 and older than 85 years 6636 5574 0 0
Y90-94 inhabitants between 90 to 94 years inclusive 6636 5574 0 0
Y_GE95 number of people 95 years or older 6636 3234 0 0
Y15-64 number of people between 15 and 64 years of age, population in economically active age 6636 5574 32 503 4334
Notes
more examples at www.iz.sk
NUTS4 / LAU1 / LAU codes for HU and PL are created by me, so they can (and will) change in the future; CZ and SK NUTS4 codes are used by local statistical offices, so they should be more stable
NUTS4 codes are consistent with NUTS3 codes used by Eurostat
local_lau variable is an identifier used by local statistical office
abbr is abbreviation of region's name, used for map purposes (usually cars' license plate code; except for Hungary)
wikidata is code used by wikidata
osm_id is region's relation number in the OpenStreetMap database
Example outputs
you can download data in CSV, xml, ods, xlsx, shp, SQL, postgis, topojson, geojson or json format at 📥 doi:10.5281/zenodo.6165135
Counties of Slovakia – unemployment rate in Slovak LAU1 regions
Regions of the Slovak Republic
Unemployment of Czechia and Slovakia – unemployment share in LAU1 regions of Slovakia and Czechia
interactive map on unemployment in Slovakia
Slovakia – SK, Czech Republic – CZ, Hungary – HU, Poland – PL, NUTS3 regions of Slovakia
download at 📥 doi:10.5281/zenodo.6165135
suggested citation: Páleník, M. (2024). LAU1 dataset [Data set]. IZ Bratislava. https://doi.org/10.5281/zenodo.6165135
WARNING: This is a pre-release dataset and its fields names and data structures are subject to change. It should be considered pre-release until the end of 2024. Expected changes:Metadata is missing or incomplete for some layers at this time and will be continuously improved.We expect to update this layer roughly in line with CDTFA at some point, but will increase the update cadence over time as we are able to automate the final pieces of the process.This dataset is continuously updated as the source data from CDTFA is updated, as often as many times a month. If you require unchanging point-in-time data, export a copy for your own use rather than using the service directly in your applications.PurposeCounty and incorporated place (city) boundaries along with third party identifiers used to join in external data. Boundaries are from the authoritative source the California Department of Tax and Fee Administration (CDTFA), altered to show the counties as one polygon. This layer displays the city polygons on top of the County polygons so the area isn"t interrupted. The GEOID attribute information is added from the US Census. GEOID is based on merged State and County FIPS codes for the Counties. Abbreviations for Counties and Cities were added from Caltrans Division of Local Assistance (DLA) data. Place Type was populated with information extracted from the Census. Names and IDs from the US Board on Geographic Names (BGN), the authoritative source of place names as published in the Geographic Name Information System (GNIS), are attached as well. Finally, coastal buffers are removed, leaving the land-based portions of jurisdictions. This feature layer is for public use.Related LayersThis dataset is part of a grouping of many datasets:Cities: Only the city boundaries and attributes, without any unincorporated areasWith Coastal BuffersWithout Coastal BuffersCounties: Full county boundaries and attributes, including all cities within as a single polygonWith Coastal BuffersWithout Coastal BuffersCities and Full Counties: A merge of the other two layers, so polygons overlap within city boundaries. Some customers require this behavior, so we provide it as a separate service.With Coastal BuffersWithout Coastal Buffers (this dataset)Place AbbreviationsUnincorporated Areas (Coming Soon)Census Designated Places (Coming Soon)Cartographic CoastlinePolygonLine source (Coming Soon)Working with Coastal BuffersThe dataset you are currently viewing includes the coastal buffers for cities and counties that have them in the authoritative source data from CDTFA. In the versions where they are included, they remain as a second polygon on cities or counties that have them, with all the same identifiers, and a value in the COASTAL field indicating if it"s an ocean or a bay buffer. If you wish to have a single polygon per jurisdiction that includes the coastal buffers, you can run a Dissolve on the version that has the coastal buffers on all the fields except COASTAL, Area_SqMi, Shape_Area, and Shape_Length to get a version with the correct identifiers.Point of ContactCalifornia Department of Technology, Office of Digital Services, odsdataservices@state.ca.govField and Abbreviation DefinitionsCOPRI: county number followed by the 3-digit city primary number used in the Board of Equalization"s 6-digit tax rate area numbering systemPlace Name: CDTFA incorporated (city) or county nameCounty: CDTFA county name. For counties, this will be the name of the polygon itself. For cities, it is the name of the county the city polygon is within.Legal Place Name: Board on Geographic Names authorized nomenclature for area names published in the Geographic Name Information SystemGNIS_ID: The numeric identifier from the Board on Geographic Names that can be used to join these boundaries to other datasets utilizing this identifier.GEOID: numeric geographic identifiers from the US Census Bureau Place Type: Board on Geographic Names authorized nomenclature for boundary type published in the Geographic Name Information SystemPlace Abbr: CalTrans Division of Local Assistance abbreviations of incorporated area namesCNTY Abbr: CalTrans Division of Local Assistance abbreviations of county namesArea_SqMi: The area of the administrative unit (city or county) in square miles, calculated in EPSG 3310 California Teale Albers.COASTAL: Indicates if the polygon is a coastal buffer. Null for land polygons. Additional values include "ocean" and "bay".GlobalID: While all of the layers we provide in this dataset include a GlobalID field with unique values, we do not recommend you make any use of it. The GlobalID field exists to support offline sync, but is not persistent, so data keyed to it will be orphaned at our next update. Use one of the other persistent identifiers, such as GNIS_ID or GEOID instead.AccuracyCDTFA"s source data notes the following about accuracy:City boundary changes and county boundary line adjustments filed with the Board of Equalization per Government Code 54900. This GIS layer contains the boundaries of the unincorporated county and incorporated cities within the state of California. The initial dataset was created in March of 2015 and was based on the State Board of Equalization tax rate area boundaries. As of April 1, 2024, the maintenance of this dataset is provided by the California Department of Tax and Fee Administration for the purpose of determining sales and use tax rates. The boundaries are continuously being revised to align with aerial imagery when areas of conflict are discovered between the original boundary provided by the California State Board of Equalization and the boundary made publicly available by local, state, and federal government. Some differences may occur between actual recorded boundaries and the boundaries used for sales and use tax purposes. The boundaries in this map are representations of taxing jurisdictions for the purpose of determining sales and use tax rates and should not be used to determine precise city or county boundary line locations. COUNTY = county name; CITY = city name or unincorporated territory; COPRI = county number followed by the 3-digit city primary number used in the California State Board of Equalization"s 6-digit tax rate area numbering system (for the purpose of this map, unincorporated areas are assigned 000 to indicate that the area is not within a city).Boundary ProcessingThese data make a structural change from the source data. While the full boundaries provided by CDTFA include coastal buffers of varying sizes, many users need boundaries to end at the shoreline of the ocean or a bay. As a result, after examining existing city and county boundary layers, these datasets provide a coastline cut generally along the ocean facing coastline. For county boundaries in northern California, the cut runs near the Golden Gate Bridge, while for cities, we cut along the bay shoreline and into the edge of the Delta at the boundaries of Solano, Contra Costa, and Sacramento counties.In the services linked above, the versions that include the coastal buffers contain them as a second (or third) polygon for the city or county, with the value in the COASTAL field set to whether it"s a bay or ocean polygon. These can be processed back into a single polygon by dissolving on all the fields you wish to keep, since the attributes, other than the COASTAL field and geometry attributes (like areas) remain the same between the polygons for this purpose.SliversIn cases where a city or county"s boundary ends near a coastline, our coastline data may cross back and forth many times while roughly paralleling the jurisdiction"s boundary, resulting in many polygon slivers. We post-process the data to remove these slivers using a city/county boundary priority algorithm. That is, when the data run parallel to each other, we discard the coastline cut and keep the CDTFA-provided boundary, even if it extends into the ocean a small amount. This processing supports consistent boundaries for Fort Bragg, Point Arena, San Francisco, Pacifica, Half Moon Bay, and Capitola, in addition to others. More information on this algorithm will be provided soon.Coastline CaveatsSome cities have buffers extending into water bodies that we do not cut at the shoreline. These include South Lake Tahoe and Folsom, which extend into neighboring lakes, and San Diego and surrounding cities that extend into San Diego Bay, which our shoreline encloses. If you have feedback on the exclusion of these items, or others, from the shoreline cuts, please reach out using the contact information above.Offline UseThis service is fully enabled for sync and export using Esri Field Maps or other similar tools. Importantly, the GlobalID field exists only to support that use case and should not be used for any other purpose (see note in field descriptions).Updates and Date of ProcessingConcurrent with CDTFA updates, approximately every two weeks, Last Processed: 12/17/2024 by Nick Santos using code path at https://github.com/CDT-ODS-DevSecOps/cdt-ods-gis-city-county/ at commit 0bf269d24464c14c9cf4f7dea876aa562984db63. It incorporates updates from CDTFA as of 12/12/2024. Future updates will include improvements to metadata and update frequency.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MCGD_Data_V2.2 contains all the data that we have collected on locations in modern China, plus a number of locations outside of China that we encounter frequently in historical sources on China. All further updates will appear under the name "MCGD_Data" with a time stamp (e.g., MCGD_Data2023-06-21)
You can also have access to this dataset and all the datasets that the ENP-China makes available on GitLab: https://gitlab.com/enpchina/IndexesEnp
Altogether there are 464,970 entries. The data include the name of locations and their variants in Chinese, pinyin, and any recorded transliteration; the name of the province in Chinese and in pinyin; Province ID; the latitude and longitude; the Name ID and Location ID, and NameID_Legacy. The Name IDs all start with H followed by seven digits. This is the internal ID system of MCGD (the NameID_Legacy column records the Name IDs in their original format depending on the source). Locations IDs that start with "DH" are data points extracted from China Historical GIS (Harvard University); those that start with "D" are locations extracted from the data points in Geonames; those that have only digits (8 digits) are data points we have added from various map sources.
One of the main features of the MCGD Main Dataset is the systematic collection and compilation of place names from non-Chinese language historical sources. Locations were designated in transliteration systems that are hardly comprehensible today, which makes it very difficult to find the actual locations they correspond to. This dataset allows for the conversion from these obsolete transliterations to the current names and geocoordinates.
From June 2021 onward, we have adopted a different file naming system to keep track of versions. From MCGD_Data_V1 we have moved to MCGD_Data_V2. In June 2022, we introduced time stamps, which result in the following naming convention: MCGD_Data_YYYY.MM.DD.
UPDATES
MCGD_Data2023.12.22 contains all the data that we have collected on locations in China, whatever the period. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown"). The dataset also includes locations outside of China for the purpose of matching such locations to the place names extracted from historical sources. For example, one may need to locate individuals born outside of China. Rather than maintaining two separate files, we made the decision to incorporate all the place names found in historical sources in the gazetteer. Such place names can easily be removed by selecting all the entries where the 'Province' data is missing.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?