Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This resource covers four million Japanese names and their romanized variants, and includes gender codes, classification codes, and frequency rankings.
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Provides comprehensive coverage for the major Chinese romanization systems and their variants, and if needed can be expanded considerably with dialectical variants (Cantonese, Hakka, Hokkien, etc.).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary data on changes to data in the Plant Names Database in the following classes: the addition of new names for formal deprecation of duplicate names changes to the status of the name as preferred name or synonym for a taxon updating the origin or occurrence of a taxon within New Zealand applying changes to the classification of a taxon updating the scientific article that is being applied to the taxa to determine whether the name is a synonym or preferred name
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Very comprehensive database of Arabic personal names and name variants mapped to the original Arabic script with a large variety of supplementary information. The database consists of 6,500,000 terms.
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
A resource of Arab personal names and variants, in the original Arabic script, this database covers several hundred thousand Arabic script variants, along with common spelling mistakes. Every Arabic name is normalized and vocalized.
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This database is part of the ArabLEX set of data which consists of the Database of Arabic General Vocabulary (DAG), Database of Arabic Place Names (DAP), Database of Foreign Names in Arabic (DAF) and Database of Arab Names (DAN) available from ELRA under references, respectively, ELRA-L0131, ELRA-M0105, ELRA-M0106 and ELRA-M0107.With over 218 million forms based on 100,000 lemmas, this full-form database covers Arab personal names (both given names and surnames) in both Arabic and English and contains a rich set of romanized name variants for each name with a variety of supplementary information such as gender, name type and frequency statistics. This comprehensive lexicon (over 6.4 million variants) contains precise phonemic transcriptions and vocalized Arabic for all inflected and cliticized forms for each name.This database is provided with three options: 1) proclitics, 2) phonetic information (CARS) and 3) orthographic variants. Subsets excluding some of the three proposed options may be provided upon demand. CARS is an accurate phonemic transcription. Optionally, phonetic transcriptions, IPA and/or SAMPA, can be provided, fine tuned to a customer's specifications.Quantity and size: 218,215,875 lines / 32,659 MB (31.9 GB)File format: flat TSV text filesSamples and a specifications document available upon request.
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This database covers non-Arabic names, their Arabic equivalents, and Arabic script variants for each name (with the most important variant given first).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States
This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.
All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names
https://cloud.google.com/bigquery/public-data/usa-names
Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @dcp from Unplash.
What are the most common names?
What are the most common female names?
Are there more female or male names?
Female names by a wide margin?
Facebook
Twitterhttps://www.ontario.ca/page/copyright-informationhttps://www.ontario.ca/page/copyright-information
This dataset contains a listing of individuals who have had their name formally changed in Ontario.
This data is made publicly available through the Ontario Gazette.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of male and female baby names in South Australia from 1944 to 2024. The annual data for baby names is published January/February each year.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
By Amber Thomas [source]
This dataset contains data on foundation products from Sephora and Ulta, including brand, product, shade name, and color information. It was used in a visual essay on The Pudding entitled The Naked Truth to better understand how beauty brands name their foundation products.
Data were collected from the US versions of Sephora and Ulta’s websites using Microsoft Playwright. Shades that were transparent or untinted were removed resulting in 6,816 swatches from 107 brands and 328 products. Shade names were determined by scraping the alt text of the swatch images using RegEx to extract the name from the description listed, with 10% of names manually extracted from the alt text. Hex values and lightness for each shade on the website for each product were determined as well using packages in R language.
Categories for shade names were assigned manually based on interpretation and may differ based on context clues provided by entire product lines. For example Estée Lauder has a shade called dawn and another called dusk, presumably referring to time of day while EXA has a similarly named shade that presumably refers to a person's name. Categories include Animal; Color; Compliment; Descriptor; Drink; Food; Gem; Location; Metal; Misc.; Name ; Plant ; Rock ; Skin ; Textile ; Wood etc..
Raw files may be different than clean files as additional manual extraction steps have been taken into account when compiling
allShades.csv. Get ready to uncover The Naked Truth with this data set!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains information about brands, products, URLs, descriptions, images sources, image alt texts and shade names for different foundation products. With this data you can gain insight into the variety of color-naming conventions used by beauty brands. The organization of this dataset includes raw files (Sephora & Ulta) and a clean file (AllShades).
Accessing the Dataset
This dataset is available as multiple .csv files on kaggle.com under “Brands’ Foundation Color Names”. All data were collected from the US versions of Sephora and Ulta’s websites on January 11th and January 18th, respectively using Microsoft Playwright. There may be some variation in the name column between these datasets because additional manual extraction was incorporated into the AllShades file for those names that couldn't be programmatically extracted in order to make all of them consistent across all platforms. Furthermore, there are categories associated with each label which have been manually assigned based on common words or phrases that appear in many product shade names. These include animals, colors, compliments, descriptors etc., which you can use to analyze trend patterns in how brands are naming their makeup products from natural elements to non-traditional vocabulary!
## Columns Overview Column names in this dataset include Brand Name; URL; Description; Image Source; Image Alt Text; Name (shade name); Specific Shade Information i..e colorspace/Hex Value/Hue/Saturation/Lightness as well as a Category describing each shade label based on its content(animal/color/compliment/descriptor etc). This column also includes an estimate of 10% per brand whose name had to be manually extracted from image alt text due lists such as 001. By creating visualizations such lightness distributions or bar graphs looking at different categorizes like type or country with filter functions you can closely look at trends between shades of foundations separated by gender norms over time! There are lots of possibilities here so get creative!
## Analyzing Trends
By leveraging R packages such as imager & magick alongside structure query language queries it’s possible analyze trends concerning things like lightness values & hex codes across multiple brands while using plots like histograms & bar charts one can compare items that have been categorized according to topic (plant versus animal etc.) Finally depending upon what type visual your goal is
- Analyzing the prevalence of certain categories (e.g. colors, foods, animals) in foundation shade names across different brands to better understand potential target markets and marketing strategies
- Using trend analyses on the dataset to track changing trends in foundation shade names over time
- Creating visualizations to compare lightness, hue and saturation measurements for each swatc...
Facebook
TwitterThis describe names of girls and boys, include the meaning of there names.
The cells are: Gender, Name, Meaning, Origin
This data can help to enrich other's data sets.
Facebook
TwitterDuring the processing of the Thomas La Fargue Papers collection, Suzanne James-Bacon created a spreadsheet to help identify and track the name variations in Romanization, forms, nicknames, and spellings found within La Fargue’s papers, Edward J. M. Rhoads’ book, Stepping Forth into the World: The Chinese Educational Mission to the United States, 1872-81, and other sources; such as the Chinese Educational Mission Connections 1872-1881 website, the Library of Congress Name Authority File, and Wikipedia. Because this spreadsheet may be of use to researchers it is being made available.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
(latest update v15.0)
The full emoji database in CSV format. All 4,159 of them!
🍕🏈👍🔥😍🏁😃🤣and more...
All emoji, with variations, and skin tones together with names, code points, as well as groups and sub-groups.
Unicode.org
They are very expressive, and efficient, and we use them every day...
For text mining and social media analytics, it's useful to be able to extract the emoji used, and develop an understanding of people's thoughts/feelings. They are mainly useful when we get their names actually, because then they become short phrases, and can be analyzed just like other phrases/words.
Another important attribute is the groups and sub-groups that they belong to. This helps categorize the different emoji as faces, flags, events, face-smiling, face-concerned, etc.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
List of first names in Brasil, collected by IBGE.
This dataset could be used in many applications, including helping de-identify text.
The files contain one name per line, including variations of the same names.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data were generated for an investigation of research data repository (RDR) mentions in biuomedical research articles.
Supplementary Table 1 is a discrete subset of SciCrunch RDRs used to study RDR mentions in biomedical literature. We generated this list by starting with the top 1000 entries in the SciCrunch database, measured by citations, removed entries for organizations (such as universities without a corresponding RDR) or non-relevant tools (such as reference managers), updated links, and consolidated duplicates resulting from RDR mergers and name variations. The resulting list of 737 RDRs is shown in with as a base based on a source list of RDRs in the SciCrunch database. The file includes the Research Resource Identifier (RRID), the RDR name, and a link to the RDR record in the SciCrunch database.
Supplementary Table 2 shows the RDRs, associated journals, and article-mention pairs (records) with text snippets extracted from mined Methods text in 2020 PubMed articles. The dataset has 4 components. The first shows the list of repositories with RDR mentions, and includes the Research Resource Identifier (RRID), the RDR name, the number of articles that mention the RDR, and a link to the record in the SciCrunch database. The second shows the list of journals in the study set with at least 1 RDR mention, andincludes the Journal ID, nam, ESSN/ISSN, the total count of publications in 2020, the number of articles that had text available to mine, the number of article-mention pairs (records), number of articles with RDR mentions, the number of unique RDRs mentioned, % of articles with minable text. The third shows the top 200 journals by RDR mention, normalized by the proportion of articles with available text to mine, with the same metadata as the second table. The fourth shows text snippets for each RDR mention, and includes the RRID, RDR name, PubMedID (PMID), DOI, article publication date, journal name, journal ID, ESSN/ISSN, article title, and snippet.
Facebook
TwitterJRC-Names is a highly multilingual named entity resource for person and organisation names (called 'entities') developed by the European Commission's Joint Research Centre (JRC). JRC-Names consists of large lists of names and their many spelling variants (up to hundreds for a single person), including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.).
The resource is the by-product of the Europe Media Monitor (EMM, see http://emm.newsbrief.eu/overview.html ) family of applications, which has been analysing up to 220,000 news reports per day, since 2004. EMM recognises names mentioned in the news in over twenty languages and decides automatically for each newly found name whether it belongs to a new entity or whether it is a spelling variant of a previously known entity. This resource allows EMM users to display news about people or organisations even if their names are spelt differently or if the news articles are written in different languages and scripts.
JRC-Names has been available for download since September 2011, consisting of name variant lists and accompanying software. The new linked data edition, accessible through the European Union's Open Data Portal, offers more information compared to the previously released resource and tool, including: titles and function names that have been historically found next to the person mentions; information about the time period during which name variants and their titles were found; various frequency counts; as well as links to other linked datasets such as DBPedia.
INFORMATION ON JRC-NAMES DATA MODEL
The JRC-Names RDF representation is based on lemon (Lexicon Model for Ontologies, see http://lemon-model.net/ ), a model which allows the expression of lexical information relative to ontologies.
JRC entities are modeled as instances of DBpedia classes (dbpedia:Person and dbpedia:Organisation) and the multilingual lexicalizations of their names and function names are represented as Lexical Entries of lemon Lexicons. Various other types of (linguistic) information and metadata are expressed using standardized vocabularies (LexInfo, OLiA, ISOCat, Lexvo, DCTerms, etc.). For cases where no already existing vocabulary could appropriately answer the needs, in-house classes and properties were defined ( see resource JRC data model for JRC names). The 'JRC-Names schema' gives an overview of how JRC-Names data is modeled.
JRC-Names has links towards the following datasets: DBpedia ( http://dbpedia.org/ ), New York Times Open Data ( http://data.nytimes.com/ ) and Talk of Europe ( http://linkedpolitics.ops.few.vu.nl ).
For further information on JRC-Names, see: https://ec.europa.eu/jrc/en/language-technologies/jrc-names .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data contains approximately 36,000 personal names derived from medieval Russian documentation. More preciously, names are collected from an edited version of the census book of Vodskaja pjatina, which was one of the five administrative areas in the late 15th century Novgorod.
Editions were compiled in parts and the first two, which cover the northernmost region, are called Переписная окладная книга по новугороду вотской пятины (1851, 1852)(POKV I‒II). The third part of the book series Новгородские пистсовые книги (1868)(NPK III) covers the southern and western parts of the study area.
The process of obtaining the personal from the inscription has been following: First, editions of the census book were obtained as scanned PDF files. These were transformed as editable copies by using OCR (=Optical Character Recognition) software Abbyy. The program read the original mid-19th century Russian text adequately with its old Russian alphabet package.
After the initial corrections, a Python script was written to harvest the personal names. This was based on exploiting the systematic formalities in how most of the names were presented in the census book. The script looked for abbreviations “дв.” and “д.” and extracted all following capitalized words until section end markers “.”, “;” or “:”. As an output, a name to pogost matrix was produced, which held the raw frequencies of each word in each pogost.
The process of cleaning the name data, in turn, has been done mostly by data wrangling program OpenRefine in following manner: For starters, all name forms shorter than four characters were removed as there were no personal names consisting of three or less letters. Furthermore, nouns that were not names were removed. This meant discarding expressions that described person’s special feature or profession, like such as being a widow (“вдова”) or working as a deacon (“діакъ”). For some reason, editors followed inconsistent conventions in capitalizing these non-name nouns.
In addition, some orthographical and morphological harmonization was done on the data. The letter ы was cut from the end of bynames, where it denotes plurality. Similarity of so called soft and hard signs, ь and ъ caused some problems. As the latter one is not used in contemporary Russian and was not used in the original documents either (Неволин 1853 : 4 (in Appendix 1)) it was removed. The soft sign ь was also removed because it was absent in the original documents and it had been used inconsistently by the editors. The letter ѣ (yat) is rarely used in personal names but nevertheless, it was changed to е (like as it is in contemporary Russian) as since it was often confused with soft and hard signs (ь and ъ). Furthermore, the letter ѳ (fita) was often erroneously recognized as о or е. As it is only found in NPK III and only in the beginning of certain names, which all are also written with “Ф” (e.g. “Ѳедко” vs. “Федко”), it was replaced with Ф.
In the second phase most of the erroneous orthographies were corrected. We do not detail herescribe all the OCR-errors here that were found, but in the following a short description is given of the most significant corrections. There were, for example, many letters whose similarity caused problems for the OCR-program (e.g. и / й and б / в). In these cases, the correct orthography was sought in the census book editions and accordingly, Openrefine was used to change erroneous forms to right correct ones.
After the corrections were made, the number of name types (= name variants) was reduced from 4942 to 2748. The Overall overall number of name tokens was dropped as well: from 36,405 to 35,726. Of the name types, more than half (1484) have only one occurrence.
The refined and harmonized data is published as pogost-by-name frequency tabulations (pogost, equivalent of English parish). The file is in tab-delimited file (.tsv) format.
References:
Неволин, К. А. 1853, О пятинах и погостах новгородских в XVI веке, с приложением карты, Санкт-Петербург (Из Записок Императорского русского географического общества, Кн. VIII).
NPK III = Новгородские писцовые книги, Т. 3 : Переписная оброчная книга Вотской пятины, 1500 года, 1868, 1868, Санкт Петербург.
POKV I, II = Переписная окладная книга по Новугороду Вотьской пятины, 1851, 1852, Имп. Моск. о-во истории и древностей рос., Москва.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author-ity 2018 dataset Prepared by Vetle Torvik Apr. 22, 2021 The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018). A total of 29.1 million Article records and 114.2 million author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. The resulting clusters are provided in two different formats, the first in a file with only IDs and PMIDs, and the second in a file with cluster summaries: #################### File 1: au2id2018.tsv #################### Each line corresponds to an author name instance (PMID and Author name position) with an Author ID. It has the following tab-delimited fields: 1. Author ID 2. PMID 3. Author name position ######################## File 2: authority2018.tsv ######################### Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants. Each cluster has a unique Author ID (the PMID of the earliest paper in the cluster and the author name position). The summary has the following tab-delimited fields: 1. Author ID (or cluster ID) e.g., 3797874_1 represents a cluster where 3797874_1 is the earliest author name instance. 2. cluster size (number of author name instances on papers) 3. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix 4. last name variants separated by '|' 5. first name variants separated by '|' 6. middle initial variants separated by '|' ('-' if none) 7. suffix variants separated by '|' ('-' if none) 8. email addresses separated by '|' ('-' if none) 9. ORCIDs separated by '|' ('-' if none). From 2019 ORCID Public Data File https://orcid.org/ and from PubMed XML 10. range of years (e.g., 1997-2009) 11. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none) 12. Top 20 most frequent MeSH (after stoplisting) with counts in parenthesis; separated by '|'; ('-' if none) 13. Journal names with counts in parenthesis (separated by '|'), 14. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none) 15. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none) 16. Author name instances (PMID_auno separated by '|') 17. Grant IDs (after normalization; '-' if none given; separated by '|'), 18. Total number of times cited. (Citations are based on references harvested from open sources such as PMC). 19. h-index 20. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by '|'
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
Taiwan's place names data includes landmarks and public facilities information.
Facebook
Twitterhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
This resource covers four million Japanese names and their romanized variants, and includes gender codes, classification codes, and frequency rankings.