7 datasets found
  1. Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  2. U.S. Powerlifting Competition Data 2015-Now

    • kaggle.com
    zip
    Updated Oct 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian McAbee (2020). U.S. Powerlifting Competition Data 2015-Now [Dataset]. https://www.kaggle.com/datasets/brianmcabee/us-powerlifting-competition-data-2015now
    Explore at:
    zip(8248868 bytes)Available download formats
    Dataset updated
    Oct 24, 2020
    Authors
    Brian McAbee
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset is composed of powerlifters who've competed in the United States at raw, full-power events from 2015 to now. All data was aggregated from OpenPowerlifting (see acknowledgments below) to create a smaller dataset with relevant American information. Use this dataset as a means to conduct EDA on American powerlifters and answer questions you may have on discrepancies among powerlifting community in the USA.

    Content

    The file is composed of 38 columns of information such as basic lifter information, weights lifted, and competition location to select from. The following sub-sections provide a breakdown of each category in the dataset, as provided by OpenPowerlifting's README.txt document that accompanies downloading their dataset.

    Name

    Mandatory. The name of the lifter in UTF-8 encoding.

    Lifters who share the same name are distinguished by use of a # symbol followed by a unique number. For example, two lifters both named John Doe would have Name values John Doe #1 and John Doe #2 respectively.

    Sex

    Mandatory. The sex category in which the lifter competed, M, F, or Mx.

    Mx (pronounced Muks) is a gender-neutral title — like Mr and Ms — originating from the UK. It is a catch-all sex category that is particularly appropriate for non-binary lifters.

    Event

    Mandatory. The type of competition that the lifter entered. For the purposes of this dataset, all event values will be SBD for lifters who've competed in events testing their squat, bench, and deadlift

    Equipment

    Mandatory. The equipment category under which the lifts were performed. For the purposes of this dataset, all values will be Raw, as it contains information about lifters who utilize minimal equipment

    Age

    Optional. The age of the lifter on the start date of the meet, if known.

    Ages can be one of two types: exact or approximate. Exact ages are given as integer numbers, for example 23. Approximate ages are given as an integer plus 0.5, for example 23.5.

    Approximate ages mean that the lifter could be either of two possible ages. For an approximate age of n + 0.5, the possible ages are n or n+1. For example, a lifter with the given age 23.5 could be either 23 or 24 -- we don't have enough information to know.

    Approximate ages occur because some federations only provide us with birth year information. So another way to think about approximate ages is that 23.5 implies that the lifter turns 24 that year.

    AgeClass

    Optional. The age class in which the filter falls, for example 40-45. These classes are based on exact age of the lifter on the day of competition.

    AgeClass is mostly useful because sometimes a federation will report that a lifter competed in the 50-54 divison without providing any further age information. This way, we can still tag them as 50-54, even if the Age column is empty.

    BirthYearClass

    Optional. The birth year class in which the filter falls, for example 40-49. The ages in the range are the oldest possible ages for the lifter that year. For example, 40-49 means "the year the lifter turns 40 through the full year in which the lifter turns 49."

    BirthYearClass is used primarily by the IPF and by IPF affiliates. Non-IPF federations tend to use AgeClass instead.

    Division

    Optional. Free-form UTF-8 text describing the division of competition, like Open or Juniors 20-23 or Professional.

    Some federations are configured in our database, which means that we have agreed on a limited set of division options for that federation, and we have rewritten their results to only use that set, and tests enforce that. Even still, divisions are not standardized between configured federations: it really is free-form text, just to provide context.

    Information about age should not be extracted from the Division, but from the AgeClass column.

    BodyweightKg

    Optional. The recorded bodyweight of the lifter at the time of competition, to two decimal places.

    WeightClassKg

    Optional. The weight class in which the lifter competed, to two decimal places.

    Weight classes can be specified as a maximum or as a minimum. Maximums are specified by just the number, for example 90 means "up to (and including) 90kg." minimums are specified by a + to the right of the number, for example 90+ means "above (and excluding) 90kg."

    Squat1Kg, Bench1Kg, Deadlift1Kg

    Optional. First attempts for each of squat, bench, and deadlift, respectively. Maximum of two decimal places.

    Negative values indicate failed attempts.

    Not all federations report attempt information. Some federations only report Best attempts.

    Squat2Kg, Bench2Kg, Deadlift2Kg

    Optional. Second attempts for each of squat, bench, and deadlift, respectively. Maximum of two decimal places.

    Negative values indicate failed attempts.

    Not all federations report attempt ...

  3. Popular White Last Names in the US

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Popular White Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-white-last-names-in-the-us/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    United States
    Description

    This dataset represents the popular last names in the United States for White.

  4. 🦈 Shark Tank US dataset 🇺🇸

    • kaggle.com
    zip
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satya Thirumani (2025). 🦈 Shark Tank US dataset 🇺🇸 [Dataset]. https://www.kaggle.com/datasets/thirumani/Shark-tank-us-dataset/code
    Explore at:
    zip(91710 bytes)Available download formats
    Dataset updated
    Jul 27, 2025
    Authors
    Satya Thirumani
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    SharkTank dataset of USA/American business reality television series. Currently, the data set has information from SharkTank season 1 to Shark Tank US season 16. The dataset has 53 fields/columns and 1440+ records.

    Below are the features/fields in the dataset:

    • Season Number - Season number
    • Startup Name - Company name or product name
    • Episode Number - Episode number within the season
    • Pitch Number - Overall pitch number
    • Season Start - Season first aired date
    • Season End - Season last aired date
    • Original Air Date - Episode original/first aired date, on OTT/TV
    • Industry - Industry name or type
    • Business Description - Business Description
    • Company Website - Website of startup/company
    • Pitchers Gender - Gender of pitchers
    • Pitchers City - US city of pitchers
    • Pitchers State - US state or country of pitchers, two letter shortcut
    • Pitchers Average Age - Average age of all pitchers, <30 young, 30-50 middle, >50 old
    • Entrepreneur Names - Pitcher names
    • Multiple Entrepreneurs - Multiple entrepreneurs are present ? 1-yes, 0-no
    • US Viewership - Viewership in US, TRP rating, in millions
    • Original Ask Amount - Original Ask Amount, in USD
    • Original Offered Equity - Original Offered Equity, in percentages
    • Valuation Requested - Valuation Requested, in USD
    • Got Deal - Got the deal or not, 1-yes, 0-no
    • Total Deal Amount - Total Deal Amount, in USD, including debt/loan amount
    • Total Deal Equity - Total Deal Equity, in percentages
    • Deal Valuation - Deal Valuation, in USD
    • Number of sharks in deal - Number of sharks in deal
    • Investment Amount Per Shark - Investment Amount Per Shark
    • Equity Per Shark - Equity received by each Shark
    • Royalty Deal - Is it royalty deal or not (1-yes)
    • Advisory Shares Equity - Deal with Advisory shares or equity, in percentages
    • Loan - Loan/debt (line of credit) amount given by sharks, in USD
    • Deal has conditions - Deal has conditions or not? (yes or no)
    • Barbara Corcoran Investment Amount - Amount Invested by Barbara Corcoran
    • Barbara Corcoran Investment Equity - Equity received by Barbara Corcoran
    • Mark Cuban Investment Amount - Amount Invested by Mark Cuban
    • Mark Cuban Investment Equity - Equity received by Mark Cuban
    • Lori Greiner Investment Amount - Amount Invested by Lori Greiner
    • Lori Greiner Investment Equity - Equity received by Lori Greiner
    • Robert Herjavec Investment Amount - Amount Invested by Robert Herjavec
    • Robert Herjavec Investment Equity - Equity received by Robert Herjavec
    • Daymond John Investment Amount - Amount Invested by Daymond John
    • Daymond John Investment Equity - Equity received by Daymond John
    • Kevin O Leary Investment Amount - Amount Invested by Kevin O'Leary
    • Kevin O Leary Investment Equity - Equity received by Kevin O'Leary
    • Guest Investment Amount - Amount Invested by Guests
    • Guest Investment Equity - Equity received by Guests
    • Guest Name - Name of Guest shark, if invested in deal
    • Barbara Corcoran Present - Whether Barbara Corcoran present in episode or not
    • Mark Cuban Present - Whether Mark Cuban present in episode or not
    • Lori Greiner Present - Whether Lori Greiner present in episode or not
    • Robert Herjavec Present - Whether Robert Herjavec present in episode or not
    • Daymond John Present - Whether Daymond John present in episode or not
    • Kevin O Leary Present - Whether Kevin O Leary present in episode or not
    • Guest Present - Whether Guest present in episode or not
  5. Popular Names for Each Indian Caste

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Popular Names for Each Indian Caste [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-names-for-each-indian-caste/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    India, World
    Description

    This dataset represents the popular names for each indian caste (Brahmin, Kshatriya, Vaisya and Sudra).

  6. 54k Resume dataset (structured)

    • kaggle.com
    zip
    Updated Nov 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suriya Ganesh (2024). 54k Resume dataset (structured) [Dataset]. https://www.kaggle.com/datasets/suriyaganesh/resume-dataset-structured
    Explore at:
    zip(39830263 bytes)Available download formats
    Dataset updated
    Nov 14, 2024
    Authors
    Suriya Ganesh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is aggregated from sources such as

    Entirely available in the public domain.

    Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.

    Dataset Overview

    This dataset contains structured information extracted from professional resumes, normalized into multiple related tables. The data includes personal information, educational background, work experience, professional skills, and abilities.

    Table Schemas

    1. people.csv

    Primary table containing core information about each individual.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERUnique identifier for each personPrimary Key, Not Null1
    nameVARCHAR(255)Full name of the personMay be Null"Database Administrator"
    emailVARCHAR(255)Email addressMay be Null"john.doe@email.com"
    phoneVARCHAR(50)Contact numberMay be Null"+1-555-0123"
    linkedinVARCHAR(255)LinkedIn profile URLMay be Null"linkedin.com/in/johndoe"

    2. abilities.csv

    Detailed abilities and competencies listed by individuals.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERReference to people tableForeign Key, Not Null1
    abilityTEXTDescription of abilityNot Null"Installation and Building Server"

    3. education.csv

    Contains educational history for each person.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERReference to people tableForeign Key, Not Null1
    institutionVARCHAR(255)Name of educational institutionMay be Null"Lead City University"
    programVARCHAR(255)Degree or program nameMay be Null"Bachelor of Science"
    start_dateVARCHAR(7)Start date of educationMay be Null"07/2013"
    locationVARCHAR(255)Location of institutionMay be Null"Atlanta, GA"

    4. experience.csv

    Details of work experience entries.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERReference to people tableForeign Key, Not Null1
    titleVARCHAR(255)Job titleMay be Null"Database Administrator"
    firmVARCHAR(255)Company nameMay be Null"Family Private Care LLC"
    start_dateVARCHAR(7)Employment start dateMay be Null"04/2017"
    end_dateVARCHAR(7)Employment end dateMay be Null"Present"
    locationVARCHAR(255)Job locationMay be Null"Roswell, GA"

    4. person_skills.csv

    Mapping table connecting people to their skills.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERReference to people tableForeign Key, Not Null1
    skillVARCHAR(255)Reference to skills tableForeign Key, Not Null"SQL Server"

    5. skills.csv

    Master list of unique skills mentioned across all resumes.

    Column NameData TypeDescriptionConstraintsExample
    skillVARCHAR(255)Unique skill namePrimary Key, Not Null"SQL Server"

    Relationships

    • Each person (people.csv) can have:
      • Multiple education entries (education.csv)
      • Multiple experience entries (experience.csv)
      • Multiple skills (person_skills.csv)
      • Multiple abilities (abilities.csv)
    • Skills (skills.csv) can be associated with multiple people
    • All relationships are maintained through the person_id field

    Data Characteristics

    Date Formats

    • All dates are stored in MM/YYYY format
    • Current positions use "Present" for end_date

    Text Fields

    • All text fields preserve original case
    • NULL values indicate missing information
    • No maximum length enforced for TEXT fields
    • VARCHAR fields have practical limits noted in schema

    Identifiers

    • person_id starts at 1 and increments sequentially
    • No natural or composite keys used
    • All relationships maintained through person_id

    Common Usage Patterns

    Basic Queries

    -- Get all skills for a person
    SELECT s.skill 
    FROM person_skills ps
    JOIN skills s ON ps.skill = s.skill
    WHERE ps.person_id = 1;
    
    -- Get complete work history
    SELECT * 
    FROM experience
    WHERE person_id = 1
    ORDER BY start_date DESC;
    

    Analytics Queries

    -- Most common skills
    SELECT s.skill, COUNT(*) as frequency
    FROM person_skills ps
    ...
    
  7. 🎖️ Medal of Honor Dataset

    • kaggle.com
    zip
    Updated Aug 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2023). 🎖️ Medal of Honor Dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/medal-of-honor-dataset
    Explore at:
    zip(889935 bytes)Available download formats
    Dataset updated
    Aug 14, 2023
    Authors
    mexwell
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    This dataset has records for the awarding of the United States Medal of Honor. The Medal of Honor is the United States of America’s highest military honor, awarded for personal acts of valor above and beyond the call of duty. The medal is awarded by the President of the United States in the name of the U.S. Congress to U.S. military personnel only. There are three versions of the medal, one for the Army, one for the Navy, and one for the Air Force.[5] Personnel of the Marine Corps and Coast Guard receive the Navy version. The dataset was collected from the official military site, and includes records about how the medal was awarded and characteristics of the recipient. Unfortunately, because of the nature of century-old record keeping, many of the records are incomplete. While a very interesting dataset, it does have some missing data.

    Data Dictionary

    KeyList of...CommentExample Value
    deathBoolean$MISSING_FIELDTrue
    nameString$MISSING_FIELD"Sagelhurst, John C."
    awarded.General Order numberInteger$MISSING_FIELD-1
    awarded.accredited toString$MISSING_FIELD""
    awarded.citationString$MISSING_FIELD"Under a heavy fire from the enemy carried off the field a commissioned officer who was severely wounded and also led a charge on the enemy's rifle pits."
    awarded.issuedString$MISSING_FIELD"01/03/1906"
    birth.location nameString$MISSING_FIELD"Buffalo, N.Y."
    metadata.linkString$MISSING_FIELD"http://www.cmohs.org/recipient-detail/1176/sagelhurst-john-c.php"
    military record.companyString$MISSING_FIELD"Company B"
    military record.divisionString$MISSING_FIELD"1st New Jersey Cavalry"
    military record.entered service atString$MISSING_FIELD"Buffalo, N.Y."
    military record.organizationString$MISSING_FIELD"U.S. Army"
    military record.rankString$MISSING_FIELD"Sergeant"
    awarded.date.dayInteger$MISSING_FIELD6
    awarded.date.fullString$MISSING_FIELD"1865-2-6"
    awarded.date.monthInteger$MISSING_FIELD2
    awarded.date.yearInteger$MISSING_FIELD1865
    awarded.location.latitudeInteger$MISSING_FIELD38
    awarded.location.longitudeInteger$MISSING_FIELD-77
    awarded.location.nameString$MISSING_FIELD"Hatchers Run Court, Stafford, VA 22554, USA"
    birth.date.dayInteger$MISSING_FIELD-1
    birth.date.monthInteger$MISSING_FIELD-1
    birth.date.yearInteger$MISSING_FIELD-1

    Acknowledgement

    Original Data

    CORGIS Dataset Project

    Foto von Samuel Branch auf Unsplash

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
Organization logoOrganization logo

Namesakes

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
jsonAvailable download formats
Dataset updated
Nov 20, 2021
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Abstract

Motivation: creating challenging dataset for testing Named-Entity

Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

Entities were collected as Wikipedia 

text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

Usage NotesEntities:

File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

Search
Clear search
Close search
Google apps
Main menu