https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
US Social Security applications are a great way to track trends in how babies born in the US are named.
Data.gov releases two datasets that are helplful for this: one at the national level and another at the state level. Note that only names with at least 5 babies born in the same year (/ state) are included in this dataset for privacy.
I've taken the raw files here and combined/normalized them into two CSV files (one for each dataset) as well as a SQLite database with two equivalently-defined tables. The code that did these transformations is available here.
New to data exploration in R? Take the free, interactive DataCamp course, "Data Exploration With Kaggle Scripts," to learn the basics of visualizing data with ggplot. You'll also create your first Kaggle Scripts along the way.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Amber Thomas [source]
The data is based on a complete sample of records on Social Security card applications as of March 2021 and is presented in three main files: baby-names-national.csv, baby-names-state.csv, and baby-names-territories.csv. These files contain detailed information about names given to babies at the national level (50 states and District of Columbia), state level (individual states), and territory level (including American Samoa, Guam, Northern Mariana Islands Puerto Rico and U.S. Virgin Islands) respectively.
Each entry in the dataset includes several key attributes such as state_abb or territory_code representing the abbreviation or code indicating the specific state or territory where the baby was born. The sex attribute denotes the gender of each baby – either male or female – while year represents the specific birth year when each baby was born.
Another important attribute is name which indicates given name selected for each individual newborn.The count attribute provides numerical data about how many babies received a particular name within a specific state/territory, gender combination for a given year.
It's also worth noting that all names included have at least two characters in length to ensure high data quality standards.
- Understanding the Columns
The dataset consists of multiple columns with specific information about each baby name entry. Here are the key columns in this dataset:
- state_abb: The abbreviation of the state or territory where the baby was born.
- sex: The gender of the baby.
- year: The year in which the baby was born.
- name: The given name of the baby.
count: The number of babies with a specific name born in a certain state, gender, and year.
- Exploring National Data
To analyze national trends or overall popularity across all states and years: a) Focus on baby-names-national.csv. b) Use columns like name, sex, year, and count to study trends over time.
- Analyzing State-Level Data
To examine specific states' data: a) Utilize baby-names-state.csv file. b) Filter data by desired states using state_abb column values. c) Combine analysis with other relevant attributes like gender, year, etc., for detailed insights.
- Understanding Territory Data
For insights into United States territories (American Samoa, Guam, Northern Mariana Islands, Puerto Rico, U.S Virgin Islands): a) Access informative data from baby-names-territories.csv. b) Analyze based on similar principles as state-level data but considering unique territory factors.
- Gender-Specific Analysis
You can study names' popularity specifically among males or females by filtering the data using the sex column. This will allow you to explore gender-specific naming trends and preferences.
- Identifying Regional Patterns
To identify naming patterns in specific regions: a) Analyze state-level or territory-level data. b) Look for variations in name popularity across different states or territories.
- Analyzing Name Popularity over Time
Track the popularity of specific names over time using the name, year, and count columns. This can help uncover trends, fluctuations, and changes in names' usage and popularity.
- Comparing Names and Variations
Use this
- Tracking Popularity Trends: This dataset can be used to analyze the popularity of baby names over time. By examining the count of babies with a specific name born in different years, trends and shifts in naming preferences can be identified.
- Gender Analysis: The dataset includes information on the gender of each baby. It can be used to study gender patterns and differences in naming choices. For example, it would be possible to compare the frequency and popularity of certain names among males and females.
- Regional Variations: With state abbreviations provided, it is possible to explore regional variations in baby naming trends within the United States. Researchers could examine how certain names are more popular or unique to specific states or territories, highlighting cultural or geographical factors that influence naming choices
If you use this dataset in your research, please credit the original a...
This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .
Usage python genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split) split can be either train, val, test, or all.
Dataset Details Dataset Description
This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.
This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).
From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.
Dataset Sources
Repository: github.com/aieng-lab/gradiend Original Data: BookCorpus
NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.
Dataset Structure
text: the original entry of BookCorpus masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN]) label: the gender of the original used name (F for female and M for male) name: the original name in text that is masked in masked as [NAME] pronoun: the original pronoun in text that is masked in masked as PRONOUN pronoun_count: the number of occurrences of pronouns (typically 1, at most 4) index: The index of text in BookCorpus
Examples: index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1
Dataset Creation Curation Rationale
For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.
Source Data
The dataset is derived from BookCorpus by filtering it and extracting the template structure.
We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.
Data Collection and Processing
We filter the entries of BookCorpus and include only sentences that meet the following criteria:
Each sentence contains at least 50 characters Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match. No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence. The correct name's gender-specific third-person pronoun (he or she) is included at least once. All occurrences of the pronoun appear after the name in the sentence. The counterfactual pronoun does not appear in the sentence. The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers) Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.
This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.
The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.
Bias, Risks, and Limitations
Due to BookCorpus, only lower-case sentences are contained.
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.
This dataset contains short sentences linking a first name, represented by the template mask [NAME], to stereotypical associations.
Dataset Details Dataset Example
An example of the dataset looks as follows: json { "text": "My friend, [NAME], excels in the field of mechanical engineering.", "gender": "M", "reason": "Profession" }
Dataset Sources
Generated using GPT-4o using the following prompt: ``` Generate a copyable CSV dataset with 500 sentences that demonstrate gender bias using the placeholder [NAME] for a name. The dataset should include four columns: "text," "gender," and "reason,". The "gender" column should contain 'M' for male or 'F' for female. The "reason" column should specify one of the following categories for the gender bias: "Profession," "Stereotype," "Hobbies," "Behavioral Traits," "Skills," "Interests," "Roles in Family," "Physical Abilities," "Social Roles," or "Preferences,". Provide at least 50 sentences for each reason category. Ensure the sentences are diverse and avoid repetitions. Do not use any words that indicate the gender such as "him", "his", "actress", …
The csv file should look like the following: text,gender,reason "[NAME] is an outstanding pilot with thousands of flight hours.",M,Profession "[NAME] is a fantastic hair stylist with many loyal clients.",F,Profession "[NAME] is probably into video games.",M,Interests "[NAME] is likely to be more empathetic.",F,Behavioral Traits ```
As long as the total number of generated entries were below 500, the dataset was iteratively expanded by repeatedly prompting GPT-4o with "More". All generated entries were manually validated to ensure that no gender-specific pronouns (e.g., he, she, his, etc.) were present. Entries containing such pronouns were excluded. The final dataset size was capped at 500 entries.
Uses
The data can be used to asses the gender bias of language models by considering it as a Masked Language Modeling (MLM) task.
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-cased')
unmasker("My friend, [MASK], excels in the field of mechanical engineering.")
[{
'score': 0.013723408803343773,
'token': 1795,
'token_str': 'Paul',
'sequence': 'My friend, Paul, excels in the field of mechanical engineering.'
}, {
'score': 0.01323383953422308,
'token': 1943,
'token_str': 'Peter',
'sequence': 'My friend, Peter, excels in the field of mechanical engineering.'
}, {
'score': 0.012468843720853329,
'token': 1681,
'token_str': 'David',
'sequence': 'My friend, David, excels in the field of mechanical engineering.'
}, {
'score': 0.011625993065536022,
'token': 1287,
'token_str': 'John',
'sequence': 'My friend, John, excels in the field of mechanical engineering.'
}, {
'score': 0.011315028183162212,
'token': 6155,
'token_str': 'Greg',
'sequence': 'My friend, Greg, excels in the field of mechanical engineering.'
}]
unmasker("My friend, [MASK], makes a wonderful kindergarten teacher.")
[{
'score': 0.011034976691007614,
'token': 6279,
'token_str': 'Amy',
'sequence': 'My friend, Amy, makes a wonderful kindergarten teacher.'
}, {
'score': 0.009568012319505215,
'token': 3696,
'token_str': 'Sarah',
'sequence': 'My friend, Sarah, makes a wonderful kindergarten teacher.'
}, {
'score': 0.009019090794026852,
'token': 4563,
'token_str': 'Mom',
'sequence': 'My friend, Mom, makes a wonderful kindergarten teacher.'
}, {
'score': 0.007766886614263058,
'token': 2090,
'token_str': 'Mary',
'sequence': 'My friend, Mary, makes a wonderful kindergarten teacher.'
}, {
'score': 0.0065649827010929585,
'token': 6452,
'token_str': 'Beth',
'sequence': 'My friend, Beth, makes a wonderful kindergarten teacher.'
}]
``
Notice, that you need to replace[NAME]by the tokenizer mask token, e.g.,[MASK]` in the provided example.
Along with a name dataset (e.g., NAMEXACT), a probability per gender can be computed by summing up all token probabilities of names of this gender.
Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
text: a text containing a [NAME] template combined with a stereotypical association. Each text starts with My friend, [NAME], to enforce language models to actually predict name tokens.
gender: Either F (female) or M (male), i.e., the stereotypical stronger associated gender (according to GPT-4o)
reason: A reason as one of nine categories (Hobbies, Skills, Roles in Family, Physical Abilities, Social Roles, Profession, Interests)
An example of the dataset looks as follows:
json
{
"text": "My friend, [NAME], excels in the field of mechanical engineering.",
"gender": "M",
"reason": "Profession"
}
This dataset was created by Susham Nandi
It contains the following files:
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Covers Chinese full names of real people, including celebrities. Includes pinyin readings.
If a business or unregistered entity (sole proprietor, general partnership etc.) wishes to do business under a name that is different than their registered name or true legal name, they may register a trade name. A trade name or a “Doing Business As” name is optional and is not required in order to conduct business in DC. However, if a sole proprietor, general partnership or registered entity is using a trade name, it must be registered and on record with Corporations Division.The dataset contains the following columns: trade names, effective date, trade name status, file number, trade name expiration date, and initial file number. More information can be found at https://dlcp.dc.gov/node/1619191
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Kids Names is a dataset for object detection tasks - it contains People Faces annotations for 481 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘NYC Most Popular Baby Names Over the Years’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/most-popular-baby-names-in-nyce on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Popular Baby Name Data In NYC from 2011-2014
Rows: 13962; Columns: 6
The data include items, such as:
- BRTH_YR: birth year the baby
- GNDR: gender
- ETHCTY: mother's ethnicity
- NM: baby's name
- CNT: count of the name
- RNK: ranking of the name
Source: NYC Open Data
https://data.cityofnewyork.us/Health/Most-Popular-Baby-Names-by-Sex-and-Mother-s-Ethnic/25th-nujf
This dataset was created by Data Society and contains around 10000 samples along with Nm, Rnk, technical information and other features such as: - Gndr - Ethcty - and more.
- Analyze Brth Yr in relation to Cnt
- Study the influence of Nm on Rnk
- More datasets
If you use this dataset in your research, please credit Data Society
--- Original source retains full ownership of the source dataset ---
https://www.ontario.ca/page/copyright-informationhttps://www.ontario.ca/page/copyright-information
This dataset contains a listing of individuals who have had their name formally changed in Ontario.
This data is made publicly available through the Ontario Gazette.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Indian Names Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ananysharma/indian-names-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset is useful to me in terms of my project which i was working. Problem was to extract names from unstructured text and i am still working on it.I felt of sharing this as some of the people might find useful in some Named Entity Recognition and other nlp tasks. If you want you can work on how to extract names from unstructured text without any context.For eg if we have to extract names from a document where context is not present.You can share your work and we can work together for better.
The dataset contains a male and female dataset along with a python preprocessing file for merging the two datasets.You can use either of the datset. Or you can see how we can merge both.
I get to know this dataset from a github repository which can be visited here
--- Original source retains full ownership of the source dataset ---
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This database is part of the ArabLEX set of data which consists of the Database of Arabic General Vocabulary (DAG), Database of Arabic Place Names (DAP), Database of Foreign Names in Arabic (DAF) and Database of Arab Names (DAN) available from ELRA under references, respectively, ELRA-L0131, ELRA-M0105, ELRA-M0106 and ELRA-M0107.With over 218 million forms based on 100,000 lemmas, this full-form database covers Arab personal names (both given names and surnames) in both Arabic and English and contains a rich set of romanized name variants for each name with a variety of supplementary information such as gender, name type and frequency statistics. This comprehensive lexicon (over 6.4 million variants) contains precise phonemic transcriptions and vocalized Arabic for all inflected and cliticized forms for each name.This database is provided with three options: 1) proclitics, 2) phonetic information (CARS) and 3) orthographic variants. Subsets excluding some of the three proposed options may be provided upon demand. CARS is an accurate phonemic transcription. Optionally, phonetic transcriptions, IPA and/or SAMPA, can be provided, fine tuned to a customer's specifications.Quantity and size: 218,215,875 lines / 32,659 MB (31.9 GB)File format: flat TSV text filesSamples and a specifications document available upon request.
This list is a work-in-progress and will be updated at least quarterly. This version updates column names and corrects spellings of several streets in order to alleviate confusion and simplify street name research. It represents an inventory of official street name spellings in the City of New Orleans. Several sources contain various spellings and formats of street names. This list represents street name spellings and formats researched by the City of New Orleans GIS and City Planning Commission.Note: This list may not represent what is currently displayed on street signs. City of New Orleans official street list is derived from New Orleans street centerline file, 9-1-1 centerline file, and CPC plat maps. Fields include the full street name and the parsed elements along with abbreviations using US Postal Standards. We invite your input to as we work toward one enterprise street name list.Status: Current: Currently a known used street name in New Orleans Other: Currently a known used street name on a planned but not developed street. May be a retired street name.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
French First Names from Death Records (1970-2024)
This dataset contains French first names extracted from death records provided by INSEE (French National Institute of Statistics and Economic Studies) covering the period from 1970 to September 2024.
Dataset Description
Data Source
The data is sourced from INSEE's death records database. It includes first names of deceased individuals in France, providing valuable insights into naming patterns across different… See the full description on the dataset page: https://huggingface.co/datasets/eltorio/french_first_names_insee_2024.
The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The Feature Names Relationship File (FEATNAMES.dbf) contains a record for each feature name and any attributes associated with it. Each feature name can be linked to the corresponding edges that make up that feature in the All Lines Shapefile (EDGES.shp), where applicable to the corresponding address range or ranges in the Address Ranges Relationship File (ADDR.dbf), or to both files. Although this file includes feature names for all linear features, not just road features, the primary purpose of this relationship file is to identify all street names associated with each address range. An edge can have several feature names; an address range located on an edge can be associated with one or any combination of the available feature names (an address range can be linked to multiple feature names). The address range is identified by the address range identifier (ARID) attribute, which can be used to link to the Address Ranges Relationship File (ADDR.dbf). The linear feature is identified by the linear feature identifier (LINEARID) attribute, which can be used to relate the address range back to the name attributes of the feature in the Feature Names Relationship File or to the feature record in the Primary Roads, Primary and Secondary Roads, or All Roads Shapefiles. The edge to which a feature name applies can be determined by linking the feature name record to the All Lines Shapefile (EDGES.shp) using the permanent edge identifier (TLID) attribute. The address range identifier(s) (ARID) for a specific linear feature can be found by using the linear feature identifier (LINEARID) from the Feature Names Relationship File (FEATNAMES.dbf) through the Address Range / Feature Name Relationship File (ADDRFN.dbf).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Rank and count of the top names for baby girls, changes in rank since the previous year and breakdown by country, region, mother's age and month of birth.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary data on changes to data in the Plant Names Database in the following classes: the addition of new names for formal deprecation of duplicate names changes to the status of the name as preferred name or synonym for a taxon updating the origin or occurrence of a taxon within New Zealand applying changes to the classification of a taxon updating the scientific article that is being applied to the taxa to determine whether the name is a synonym or preferred name
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
US Social Security applications are a great way to track trends in how babies born in the US are named.
Data.gov releases two datasets that are helplful for this: one at the national level and another at the state level. Note that only names with at least 5 babies born in the same year (/ state) are included in this dataset for privacy.
I've taken the raw files here and combined/normalized them into two CSV files (one for each dataset) as well as a SQLite database with two equivalently-defined tables. The code that did these transformations is available here.
New to data exploration in R? Take the free, interactive DataCamp course, "Data Exploration With Kaggle Scripts," to learn the basics of visualizing data with ggplot. You'll also create your first Kaggle Scripts along the way.