Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset includes all personal names listed in the Wikipedia category “American people by ethnic or national origin” and all subcategories fitting the pattern “American People of [ ] descent”, in total more than 25,000 individuals. Each individual is represented by a row, with columns indicating binary membership (0/1) in each ethnic/national category.
Ethnicity inference is an essential tool for identifying disparities in public health and social sciences. Existing datasets linking personal names to ethnic or national origin often neglect to recognize multi-ethnic or multi-national identities. Furthermore, existing datasets use coarse classification schemes (e.g. classifying both Indian and Japanese people as “Asian”) that may not be suitable for many research questions. This dataset remedies these problems by including both very fine-grain ethnic/national categories (e.g. Afghan-Jewish) and more broad ones (e.g. European). Users can chose the categories that are relevant to their research. Since many Americans on Wikipedia are associated with multiple overlapping or distinct ethnicities/nationalities, these multi-ethnic associations are also reflected in the data.
Data were obtained from the Wikipedia API and reviewed manually to remove stage names, pen names, mononyms, first initials (when full names are available on Wikipedia), nicknames, honorific titles, and pages that correspond to a group or event rather than an individual.
This dataset was designed for use in training classification algorithms, but may also be independently interesting inasmuch as it is a representative sample of Americans who are famous enough to have their own Wikipedia page, along with detailed information on their ethnic/national origins.
DISCLAIMER: Due to the incomplete nature of Wikipedia, data may not properly reflect all ethnic national associations for any given individual. For example, there is no guarantee that a given Cuban Jewish person will be listed in both the “American People of Cuban descent” and the “American People of Jewish descent” categories.
Facebook
TwitterPopular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
Facebook
TwitterThe data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Motivation: creating challenging dataset for testing Named-Entity
Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods
Entities were collected as Wikipedia
text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.
The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).
Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.
Usage NotesEntities:
File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.
News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.
Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.
Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".
The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.
The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
Facebook
TwitterThis public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data. All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
There's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterA dataset full of first names can be incredibly helpful in various applications and analyses. Firstly, it can be used in demographic studies to analyze naming trends and patterns over time, providing insights into cultural and societal changes. Additionally, such a dataset can be utilized in market research and targeted advertising, allowing businesses to personalize their marketing strategies based on customers' names. It can also be employed in language processing tasks, such as name entity recognition, sentiment analysis, or gender prediction. Moreover, the dataset can serve as a valuable resource for generating test data, creating fictional characters, or enhancing natural language generation models. Overall, a comprehensive first name dataset has diverse applications across multiple domains.
This dataset contains common Indian names, this data can be used for any sort of NLP problems such as name generation etc. An upvote would be appreciated :)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Historic lists of top 100 names for baby boys and girls for 1904 to 2024 at 10-yearly intervals.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Distribution of first and last name frequencies of academic authors by country.
Spreadsheet 1 contains 50 countries, with names based on affiliations in Scopus journal articles 2001-2021.
Spreadsheet 2 contains 200 countries, with names based on affiliations in Scopus journal articles 2001-2021, using a marginally updated last name extraction algorithm that is almost the same except for Dutch/Flemish names.
From the paper: Can national researcher mobility be tracked by first or last name uniqueness?
For example the distribution for the UK shows a single peak for international names, with no national names, Belgium has a national peak and an international peak, and China has mainly a national peak. The 50 countries are:
No Code Country 1 SB Serbia 2 IE Ireland 3 HU Hungary 4 CL Chile 5 CO Columbia 6 NG Nigeria 7 HK Hong Kong 8 AR Argentina 9 SG Singapore 10 NZ New Zealand 11 PK Pakistan 12 TH Thailand 13 UA Ukraine 14 SA Saudi Arabia 15 RO Israel 16 ID Indonesia 17 IL Israel 18 MY Malaysia 19 DK Denmark 20 CZ Czech Republic 21 ZA South Africa 22 AT Austria 23 FI Finland 24 PT Portugal 25 GR Greece 26 NO Norway 27 EG Egypt 28 MX Mexico 29 BE Belgium 30 CH Switzerland 31 SW Sweden 32 PL Poland 33 TW Taiwan 34 NL Netherlands 35 TK Turkey 36 IR Iran 37 RU Russia 38 AU Australia 39 BR Brazil 40 KR South Korea 41 ES Spain 42 CA Canada 43 IT France 44 FR France 45 IN India 46 DE Germany 47 US USA 48 UK UK 49 JP Japan 50 CN China
Facebook
TwitterThis dataset represents the 5000 popular last names in the United States. The data is split by race to show the percentages against each option.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Queensland Top 100 Baby Names
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Rank and count of the top names for baby girls, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.
Facebook
Twitterhttps://brightdata.com/licensehttps://brightdata.com/license
Use our Facebook Profiles dataset to explore public profile details such as names, profile and cover photos, work history, education, and photo galleries. Common use cases include people and company research, influencer discovery, and academic studies of career and education signals on Facebook. Over 31M records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:
Profile URL Profile Name Facebook Profile ID Profile Photo Cover Photo Work History (Title, Company, Company ID, Company URL, Start/End Dates) College Education (Name, ID, URL) High School Education (Name, ID, URL) Photo Galleries And much more
Facebook
TwitterThis dataset represents the popular last names in the United States for White.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Rank and count of the top names for baby boys, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open-access database of englacial temperature measurements compiled from data submissions and published literature. It is developed on GitHub and published to Zenodo. The dataset is described in the following publication:
Mylène Jacquemart, Ethan Welty, Marcus Gastaldello, and Guillem Carcanade (2025). glenglat: A database of global englacial temperatures. Earth System Science Data Discussions. https://doi.org/10.5194/essd-2024-249
Dataset structure
The dataset adheres to the Frictionless Data Tabular Data Package specification. The metadata in datapackage.json describes, in detail, the contents of the tabular data files in the data folder:
source.csv: Description of each data source (either a personal communication or the reference to a published study).
borehole.csv: Description of each borehole (location, elevation, etc), linked to source.csv via source_id and less formally via source identifiers in notes.
profile.csv: Description of each profile (date, etc), linked to borehole.csv via borehole_id and to source.csv via source_id and less formally via source identifiers in notes.
measurement.csv: Description of each measurement (depth and temperature), linked to profile.csv via borehole_id and profile_id.
For boreholes with many profiles (e.g. from automated loggers), pairs of profile.csv and measurement.csv are stored separately in subfolders of data named {source.id}-{glacier}, where glacier is a simplified and kebab-cased version of the glacier name (e.g. flowers2022-little-kluane).
Supporting information
The folder sources, available on GitHub but omitted from dataset releases on Zenodo, contains subfolders (with names matching column source.id) with files that document how and from where the data was extracted.
Tables
Jump to: source · borehole · profile · measurement
source
Sources of information considered in the compilation of this database. Column names and categorical values closely follow the Citation Style Language (CSL) 1.0.2 specification. Names of people in non-Latin scripts are followed by a latinization in square brackets (e.g. В. С. Загороднов [V. S. Zagorodnov]) and non-English titles are followed by a translation in square brackets. The family name of Latin-script names is wrapped in curly braces when it is not the last word of the name (e.g. Emmanuel {Le Meur}, e.g. {Duan} Keqin) or the name ends in two or more unabbreviated words (e.g. Jon Ove {Hagen}). The family name of a Chinese name (and of the latinization) is wrapped in curly braces when it is not the first character.
name type description
id (required) string Unique identifier constructed from the first author's lowercase, latinized, family name and the publication year, followed as needed by a lowercase letter to ensure uniqueness (e.g. Загороднов 1981 → zagorodnov1981a).
author string Author names (optionally followed by their ORCID or contact email in parentheses) as a pipe-delimited list.
year (required) year Year of publication.
type (required) string Item type.- article-journal: Journal article- book: Book (if the entire book is relevant)- chapter: Book section- document: Document not fitting into any other category- dataset: Collection of data- map: Geographic map- paper-conference: Paper published in conference proceedings- personal-communication: Personal communication between individuals- speech: Presentation (talk, poster) at a conference- report: Report distributed by an institution- thesis-phd: Doctor of Philosophy (PhD) thesis- thesis-msc: Master of Science (MSc) thesis- webpage: Website or page on a website
title (required) string Item title.
url string URL (DOI if available).
language (required) string Language as ISO 639-1 two-letter language code.- da: Danish- de: German- en: English- es: Spanish- fr: French- ja: Japanese- ko: Korean- ru: Russian- sv: Swedish- zh: Chinese
container_title string Title of the container (e.g. journal, book).
volume integer Volume number of the item or container.
issue string Issue number (e.g. 1) or range (e.g. 1-2) of the item or container, with an optional letter prefix (e.g. F1) or part number (e.g. 75pt2).
page string Page number (e.g. 1) or range (e.g. 1-2) of the item in the container, with an optional letter prefix (e.g. S1).
version string Version number (e.g. 1.0) of the item.
editor string Editor names (e.g. of the containing book) as a pipe-delimited list.
collection_title string Title of the collection (e.g. book series).
collection_number string Number (e.g. 1) or range (e.g. 1-2) in the collection (e.g. book series volume).
publisher string Publisher name.
borehole
Metadata about each borehole.
name type description
id (required) integer Unique identifier.
source_id (required) string Identifier of the source of the earliest temperature measurements. This is also the source of the borehole attributes unless otherwise stated in notes.
glacier_name (required) string Glacier or ice cap name (as reported).
glims_id string Global Land Ice Measurements from Space (GLIMS) glacier identifier.
location_origin (required) string Origin of location (latitude, longitude).- submitted: Provided in data submission- published: Reported as coordinates in original publication- digitized: Digitized from published map with complete axes- estimated: Estimated from published plot by comparing to a map (e.g. Google Maps, CalTopo)- guessed: Estimated with difficulty, for example by comparing elevation to a map (e.g. Google Maps, CalTopo)
latitude (required) number [degree] Latitude (EPSG 4326).
longitude (required) number [degree] Longitude (EPSG 4326).
elevation_origin (required) string Origin of elevation (elevation).- submitted: Provided in data submission- published: Reported as number in original publication- digitized: Digitized from published plot with complete axes- estimated: Estimated from elevation contours in published map- guessed: Estimated with difficulty, for example by comparing location (latitude, longitude) to a map of contemporary elevations (e.g. CalTopo, Google Maps)
elevation (required) number [m] Elevation above sea level.
mass_balance_area string Mass balance area.- ablation: Ablation area- equilibrium: Near the equilibrium line- accumulation: Accumulation area
label string Borehole name (e.g. as labeled on a plot).
date_min date (%Y-%m-%d) Begin date of drilling, or if not known precisely, the first possible date (e.g. 2019 → 2019-01-01).
date_max date (%Y-%m-%d) End date of drilling, or if not known precisely, the last possible date (e.g. 2019 → 2019-12-31).
drill_method string Drilling method.- mechanical: Push, percussion, rotary- thermal: Hot point, electrothermal, steam- combined: Mechanical and thermal
ice_depth number [m] Starting depth of continuous ice. Infinity (INF) indicates that only snow, firn, or intermittent ice was reached.
depth number [m] Total borehole depth (not including drilling in the underlying bed).
to_bed boolean Whether the borehole reached the glacier bed.
temperature_uncertainty number [°C] Estimated temperature uncertainty (as reported).
notes string Additional remarks about the study site, the borehole, or the measurements therein as a pipe-delimited list. Sources are referenced by source.id. Quality concerns are prefixed with '[flag]'.
curator string Names of people who added the data to the database, as a pipe-delimited list.
investigators string Names of people and/or agencies who performed the work, as a pipe-delimited list. Each entry is in the format 'person (agency; ...) {notes}', where only person or one agency is required. Person and agency may contain a latinized form in square brackets.
funding string Funding sources as a pipe-delimited list. Each entry is in the format 'funder [rorid] > award [number] url', where only funder is required and rorid is the funder's ROR (https://ror.org) ID (e.g. 01jtrvx49).
profile
Date and time of each measurement profile.
name type description
borehole_id (required) integer Borehole identifier.
id (required) integer Borehole profile identifier (starting from 1 for each borehole).
source_id (required) string Source identifier.
measurement_origin (required) string Origin of measurements (measurement.depth, measurement.temperature).- submitted: Provided as numbers in data submission- published: Numbers read from original publication- digitized-discrete: Digitized with Plot Digitizer from discrete points of depth versus temperature- digitized-continuous: Digitized with Plot Digitizer from a continuous data source (e.g. line plot of depth versus temperature)
date_min date (%Y-%m-%d) Measurement date, or if not known precisely, the first possible date (e.g. 2019 → 2019-01-01).
date_max (required) date (%Y-%m-%d) Measurement date, or if not known precisely, the last possible date (e.g. 2019 → 2019-12-31).
time time (%H:%M:%S) Measurement time.
utc_offset number [h] Time offset relative to Coordinated Universal Time (UTC).
equilibrium string Whether and how reported temperatures equilibrated following drilling.- true: Equilibrium was measured- estimated: Equilibrium was estimated (typically by extrapolation)- false: Equilibrium was not reached
notes string Additional remarks about the profile or the measurements therein as a pipe-delimited list. Sources are referenced by source.id. Quality concerns are prefixed with '[flag]'.
measurement
Temperature measurements with depth.
name type description
borehole_id (required) integer Borehole identifier.
profile_id (required) integer Borehole profile identifier.
depth (required) number [m] Depth below the glacier surface.
temperature (required) number [°C] Temperature.
Facebook
TwitterThese data comprise Census records relating to the Alaskan people's population demographics for the State of Alaskan Salmon and People (SASAP) Project. Decennial census data were originally extracted from IPUMS National Historic Geographic Information Systems website: https://data2.nhgis.org/main(Citation: Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 12.0 [Database]. Minneapolis: University of Minnesota. 2017. http://doi.org/10.18128/D050.V12.0). A number of relevant tables of basic demographics on age and race, household income and poverty levels, and labor force participation were extracted.
These particular variables were selected as part of an effort to understand and potentially quantify various dimensions of well-being in Alaskan communities.
The file "censusdata_master.csv" is a consolidation of all 21 other data files in the package. For detailed information on how the datasets vary over different years, view the file "readme.docx" available in this data package.
The included .Rmd file is a script which combines the 21 files by year into a single file (censusdata_master.csv). It also cleans up place names (including typographical errors) and uses the
USGS place names dataset and the SASAP regions dataset to assign latitude and longitude values and region values to each place in the dataset. Note that some places were not assigned a region or
location because they do not fit well into the regional framework.
Considerable heterogeneity exists between census surveys each year. While we have attempted to combine these datasets in a way that makes sense, there may be some discrepancies or unexpected values.
Please send a description of any unusual values to the dataset contact.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset includes all personal names listed in the Wikipedia category “American people by ethnic or national origin” and all subcategories fitting the pattern “American People of [ ] descent”, in total more than 25,000 individuals. Each individual is represented by a row, with columns indicating binary membership (0/1) in each ethnic/national category.
Ethnicity inference is an essential tool for identifying disparities in public health and social sciences. Existing datasets linking personal names to ethnic or national origin often neglect to recognize multi-ethnic or multi-national identities. Furthermore, existing datasets use coarse classification schemes (e.g. classifying both Indian and Japanese people as “Asian”) that may not be suitable for many research questions. This dataset remedies these problems by including both very fine-grain ethnic/national categories (e.g. Afghan-Jewish) and more broad ones (e.g. European). Users can chose the categories that are relevant to their research. Since many Americans on Wikipedia are associated with multiple overlapping or distinct ethnicities/nationalities, these multi-ethnic associations are also reflected in the data.
Data were obtained from the Wikipedia API and reviewed manually to remove stage names, pen names, mononyms, first initials (when full names are available on Wikipedia), nicknames, honorific titles, and pages that correspond to a group or event rather than an individual.
This dataset was designed for use in training classification algorithms, but may also be independently interesting inasmuch as it is a representative sample of Americans who are famous enough to have their own Wikipedia page, along with detailed information on their ethnic/national origins.
DISCLAIMER: Due to the incomplete nature of Wikipedia, data may not properly reflect all ethnic national associations for any given individual. For example, there is no guarantee that a given Cuban Jewish person will be listed in both the “American People of Cuban descent” and the “American People of Jewish descent” categories.