7 datasets found

Namesakes
figshare.com
json
Updated Nov 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17009105.v1
Dataset updated
Nov 20, 2021
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Motivation: creating challenging dataset for testing Named-Entity

Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

Entities were collected as Wikipedia

text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

Usage NotesEntities:

File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
U.S. Powerlifting Competition Data 2015-Now
kaggle.com
zip
Updated Oct 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian McAbee (2020). U.S. Powerlifting Competition Data 2015-Now [Dataset]. https://www.kaggle.com/datasets/brianmcabee/us-powerlifting-competition-data-2015now
Explore at:
zip(8248868 bytes)Available download formats
Dataset updated
Oct 24, 2020
Authors
Brian McAbee
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset is composed of powerlifters who've competed in the United States at raw, full-power events from 2015 to now. All data was aggregated from OpenPowerlifting (see acknowledgments below) to create a smaller dataset with relevant American information. Use this dataset as a means to conduct EDA on American powerlifters and answer questions you may have on discrepancies among powerlifting community in the USA.

Content

The file is composed of 38 columns of information such as basic lifter information, weights lifted, and competition location to select from. The following sub-sections provide a breakdown of each category in the dataset, as provided by OpenPowerlifting's README.txt document that accompanies downloading their dataset.

Name

Mandatory. The name of the lifter in UTF-8 encoding.

Lifters who share the same name are distinguished by use of a # symbol followed by a unique number. For example, two lifters both named John Doe would have Name values John Doe #1 and John Doe #2 respectively.

Sex

Mandatory. The sex category in which the lifter competed, M, F, or Mx.

Mx (pronounced Muks) is a gender-neutral title — like Mr and Ms — originating from the UK. It is a catch-all sex category that is particularly appropriate for non-binary lifters.

Event

Mandatory. The type of competition that the lifter entered. For the purposes of this dataset, all event values will be SBD for lifters who've competed in events testing their squat, bench, and deadlift

Equipment

Mandatory. The equipment category under which the lifts were performed. For the purposes of this dataset, all values will be Raw, as it contains information about lifters who utilize minimal equipment

Age

Optional. The age of the lifter on the start date of the meet, if known.

Ages can be one of two types: exact or approximate. Exact ages are given as integer numbers, for example 23. Approximate ages are given as an integer plus 0.5, for example 23.5.

Approximate ages mean that the lifter could be either of two possible ages. For an approximate age of n + 0.5, the possible ages are n or n+1. For example, a lifter with the given age 23.5 could be either 23 or 24 -- we don't have enough information to know.

Approximate ages occur because some federations only provide us with birth year information. So another way to think about approximate ages is that 23.5 implies that the lifter turns 24 that year.

AgeClass

Optional. The age class in which the filter falls, for example 40-45. These classes are based on exact age of the lifter on the day of competition.

AgeClass is mostly useful because sometimes a federation will report that a lifter competed in the 50-54 divison without providing any further age information. This way, we can still tag them as 50-54, even if the Age column is empty.

BirthYearClass

Optional. The birth year class in which the filter falls, for example 40-49. The ages in the range are the oldest possible ages for the lifter that year. For example, 40-49 means "the year the lifter turns 40 through the full year in which the lifter turns 49."

BirthYearClass is used primarily by the IPF and by IPF affiliates. Non-IPF federations tend to use AgeClass instead.

Division

Optional. Free-form UTF-8 text describing the division of competition, like Open or Juniors 20-23 or Professional.

Some federations are configured in our database, which means that we have agreed on a limited set of division options for that federation, and we have rewritten their results to only use that set, and tests enforce that. Even still, divisions are not standardized between configured federations: it really is free-form text, just to provide context.

Information about age should not be extracted from the Division, but from the AgeClass column.

BodyweightKg

Optional. The recorded bodyweight of the lifter at the time of competition, to two decimal places.

WeightClassKg

Optional. The weight class in which the lifter competed, to two decimal places.

Weight classes can be specified as a maximum or as a minimum. Maximums are specified by just the number, for example 90 means "up to (and including) 90kg." minimums are specified by a + to the right of the number, for example 90+ means "above (and excluding) 90kg."

Squat1Kg, Bench1Kg, Deadlift1Kg

Optional. First attempts for each of squat, bench, and deadlift, respectively. Maximum of two decimal places.

Negative values indicate failed attempts.

Not all federations report attempt information. Some federations only report Best attempts.

Squat2Kg, Bench2Kg, Deadlift2Kg

Optional. Second attempts for each of squat, bench, and deadlift, respectively. Maximum of two decimal places.

Negative values indicate failed attempts.

Not all federations report attempt ...
Popular White Last Names in the US
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Popular White Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-white-last-names-in-the-us/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Area covered
United States
Description
This dataset represents the popular last names in the United States for White.
🦈 Shark Tank US dataset 🇺🇸
kaggle.com
zip
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satya Thirumani (2025). 🦈 Shark Tank US dataset 🇺🇸 [Dataset]. https://www.kaggle.com/datasets/thirumani/Shark-tank-us-dataset/code
Explore at:
zip(91710 bytes)Available download formats
Dataset updated
Jul 27, 2025
Authors
Satya Thirumani
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
SharkTank dataset of USA/American business reality television series. Currently, the data set has information from SharkTank season 1 to Shark Tank US season 16. The dataset has 53 fields/columns and 1440+ records.

Below are the features/fields in the dataset:

Season Number - Season number

Startup Name - Company name or product name

Episode Number - Episode number within the season

Pitch Number - Overall pitch number

Season Start - Season first aired date

Season End - Season last aired date

Original Air Date - Episode original/first aired date, on OTT/TV

Industry - Industry name or type

Business Description - Business Description

Company Website - Website of startup/company

Pitchers Gender - Gender of pitchers

Pitchers City - US city of pitchers

Pitchers State - US state or country of pitchers, two letter shortcut

Pitchers Average Age - Average age of all pitchers, <30 young, 30-50 middle, >50 old

Entrepreneur Names - Pitcher names

Multiple Entrepreneurs - Multiple entrepreneurs are present ? 1-yes, 0-no

US Viewership - Viewership in US, TRP rating, in millions

Original Ask Amount - Original Ask Amount, in USD

Original Offered Equity - Original Offered Equity, in percentages

Valuation Requested - Valuation Requested, in USD

Got Deal - Got the deal or not, 1-yes, 0-no

Total Deal Amount - Total Deal Amount, in USD, including debt/loan amount

Total Deal Equity - Total Deal Equity, in percentages

Deal Valuation - Deal Valuation, in USD

Number of sharks in deal - Number of sharks in deal

Investment Amount Per Shark - Investment Amount Per Shark

Equity Per Shark - Equity received by each Shark

Royalty Deal - Is it royalty deal or not (1-yes)

Advisory Shares Equity - Deal with Advisory shares or equity, in percentages

Loan - Loan/debt (line of credit) amount given by sharks, in USD

Deal has conditions - Deal has conditions or not? (yes or no)

Barbara Corcoran Investment Amount - Amount Invested by Barbara Corcoran

Barbara Corcoran Investment Equity - Equity received by Barbara Corcoran

Mark Cuban Investment Amount - Amount Invested by Mark Cuban

Mark Cuban Investment Equity - Equity received by Mark Cuban

Lori Greiner Investment Amount - Amount Invested by Lori Greiner

Lori Greiner Investment Equity - Equity received by Lori Greiner

Robert Herjavec Investment Amount - Amount Invested by Robert Herjavec

Robert Herjavec Investment Equity - Equity received by Robert Herjavec

Daymond John Investment Amount - Amount Invested by Daymond John

Daymond John Investment Equity - Equity received by Daymond John

Kevin O Leary Investment Amount - Amount Invested by Kevin O'Leary

Kevin O Leary Investment Equity - Equity received by Kevin O'Leary

Guest Investment Amount - Amount Invested by Guests

Guest Investment Equity - Equity received by Guests

Guest Name - Name of Guest shark, if invested in deal

Barbara Corcoran Present - Whether Barbara Corcoran present in episode or not

Mark Cuban Present - Whether Mark Cuban present in episode or not

Lori Greiner Present - Whether Lori Greiner present in episode or not

Robert Herjavec Present - Whether Robert Herjavec present in episode or not

Daymond John Present - Whether Daymond John present in episode or not

Kevin O Leary Present - Whether Kevin O Leary present in episode or not

Guest Present - Whether Guest present in episode or not
Popular Names for Each Indian Caste
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Popular Names for Each Indian Caste [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-names-for-each-indian-caste/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Area covered
India, World
Description
This dataset represents the popular names for each indian caste (Brahmin, Kshatriya, Vaisya and Sudra).

54k Resume dataset (structured)

kaggle.com

zip

Updated Nov 14, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Suriya Ganesh (2024). 54k Resume dataset (structured) [Dataset]. https://www.kaggle.com/datasets/suriyaganesh/resume-dataset-structured

Explore at:

zip(39830263 bytes)Available download formats

Dataset updated

Nov 14, 2024

Authors

Suriya Ganesh

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset is aggregated from sources such as

Entirely available in the public domain.

Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.

Dataset Overview

This dataset contains structured information extracted from professional resumes, normalized into multiple related tables. The data includes personal information, educational background, work experience, professional skills, and abilities.

Table Schemas

1. people.csv

Primary table containing core information about each individual.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Unique identifier for each person	Primary Key, Not Null	1
name	VARCHAR(255)	Full name of the person	May be Null	"Database Administrator"
email	VARCHAR(255)	Email address	May be Null	"john.doe@email.com"
phone	VARCHAR(50)	Contact number	May be Null	"+1-555-0123"
linkedin	VARCHAR(255)	LinkedIn profile URL	May be Null	"linkedin.com/in/johndoe"

2. abilities.csv

Detailed abilities and competencies listed by individuals.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Reference to people table	Foreign Key, Not Null	1
ability	TEXT	Description of ability	Not Null	"Installation and Building Server"

3. education.csv

Contains educational history for each person.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Reference to people table	Foreign Key, Not Null	1
institution	VARCHAR(255)	Name of educational institution	May be Null	"Lead City University"
program	VARCHAR(255)	Degree or program name	May be Null	"Bachelor of Science"
start_date	VARCHAR(7)	Start date of education	May be Null	"07/2013"
location	VARCHAR(255)	Location of institution	May be Null	"Atlanta, GA"

4. experience.csv

Details of work experience entries.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Reference to people table	Foreign Key, Not Null	1
title	VARCHAR(255)	Job title	May be Null	"Database Administrator"
firm	VARCHAR(255)	Company name	May be Null	"Family Private Care LLC"
start_date	VARCHAR(7)	Employment start date	May be Null	"04/2017"
end_date	VARCHAR(7)	Employment end date	May be Null	"Present"
location	VARCHAR(255)	Job location	May be Null	"Roswell, GA"

4. person_skills.csv

Mapping table connecting people to their skills.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Reference to people table	Foreign Key, Not Null	1
skill	VARCHAR(255)	Reference to skills table	Foreign Key, Not Null	"SQL Server"

5. skills.csv

Master list of unique skills mentioned across all resumes.

Column Name	Data Type	Description	Constraints	Example
skill	VARCHAR(255)	Unique skill name	Primary Key, Not Null	"SQL Server"

Relationships

Each person (people.csv) can have:
- Multiple education entries (education.csv)
- Multiple experience entries (experience.csv)
- Multiple skills (person_skills.csv)
- Multiple abilities (abilities.csv)
Skills (skills.csv) can be associated with multiple people
All relationships are maintained through the person_id field

Data Characteristics

Date Formats

All dates are stored in MM/YYYY format
Current positions use "Present" for end_date

Text Fields

All text fields preserve original case
NULL values indicate missing information
No maximum length enforced for TEXT fields
VARCHAR fields have practical limits noted in schema

Identifiers

person_id starts at 1 and increments sequentially
No natural or composite keys used
All relationships maintained through person_id

Common Usage Patterns

Basic Queries

-- Get all skills for a person
SELECT s.skill 
FROM person_skills ps
JOIN skills s ON ps.skill = s.skill
WHERE ps.person_id = 1;

-- Get complete work history
SELECT * 
FROM experience
WHERE person_id = 1
ORDER BY start_date DESC;

Analytics Queries

-- Most common skills
SELECT s.skill, COUNT(*) as frequency
FROM person_skills ps
...

🎖️ Medal of Honor Dataset

kaggle.com

zip

Updated Aug 14, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

mexwell (2023). 🎖️ Medal of Honor Dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/medal-of-honor-dataset

Explore at:

zip(889935 bytes)Available download formats

Dataset updated

Aug 14, 2023

Authors

mexwell

License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Description

This dataset has records for the awarding of the United States Medal of Honor. The Medal of Honor is the United States of America’s highest military honor, awarded for personal acts of valor above and beyond the call of duty. The medal is awarded by the President of the United States in the name of the U.S. Congress to U.S. military personnel only. There are three versions of the medal, one for the Army, one for the Navy, and one for the Air Force.[5] Personnel of the Marine Corps and Coast Guard receive the Navy version. The dataset was collected from the official military site, and includes records about how the medal was awarded and characteristics of the recipient. Unfortunately, because of the nature of century-old record keeping, many of the records are incomplete. While a very interesting dataset, it does have some missing data.

Data Dictionary

Key	List of...	Comment	Example Value
death	Boolean	$MISSING_FIELD	`True`
name	String	$MISSING_FIELD	`"Sagelhurst, John C."`
awarded.General Order number	Integer	$MISSING_FIELD	`-1`
awarded.accredited to	String	$MISSING_FIELD	`""`
awarded.citation	String	$MISSING_FIELD	`"Under a heavy fire from the enemy carried off the field a commissioned officer who was severely wounded and also led a charge on the enemy's rifle pits."`
awarded.issued	String	$MISSING_FIELD	`"01/03/1906"`
birth.location name	String	$MISSING_FIELD	`"Buffalo, N.Y."`
metadata.link	String	$MISSING_FIELD	`"http://www.cmohs.org/recipient-detail/1176/sagelhurst-john-c.php"`
military record.company	String	$MISSING_FIELD	`"Company B"`
military record.division	String	$MISSING_FIELD	`"1st New Jersey Cavalry"`
military record.entered service at	String	$MISSING_FIELD	`"Buffalo, N.Y."`
military record.organization	String	$MISSING_FIELD	`"U.S. Army"`
military record.rank	String	$MISSING_FIELD	`"Sergeant"`
awarded.date.day	Integer	$MISSING_FIELD	`6`
awarded.date.full	String	$MISSING_FIELD	`"1865-2-6"`
awarded.date.month	Integer	$MISSING_FIELD	`2`
awarded.date.year	Integer	$MISSING_FIELD	`1865`
awarded.location.latitude	Integer	$MISSING_FIELD	`38`
awarded.location.longitude	Integer	$MISSING_FIELD	`-77`
awarded.location.name	String	$MISSING_FIELD	`"Hatchers Run Court, Stafford, VA 22554, USA"`
birth.date.day	Integer	$MISSING_FIELD	`-1`
birth.date.month	Integer	$MISSING_FIELD	`-1`
birth.date.year	Integer	$MISSING_FIELD	`-1`

Acknowledgement

Original Data

CORGIS Dataset Project

Foto von Samuel Branch auf Unsplash

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1

Namesakes

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

jsonAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.17009105.v1

Dataset updated

Nov 20, 2021

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Abstract

Motivation: creating challenging dataset for testing Named-Entity

Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

Entities were collected as Wikipedia

text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

Usage NotesEntities:

File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

Clear search

Close search

Google apps

Main menu

Namesakes

U.S. Powerlifting Competition Data 2015-Now

Context

Content

Name

Sex

Event

Equipment

Age

AgeClass

BirthYearClass

Division

BodyweightKg

WeightClassKg

Squat1Kg, Bench1Kg, Deadlift1Kg

Squat2Kg, Bench2Kg, Deadlift2Kg

Popular White Last Names in the US

🦈 Shark Tank US dataset 🇺🇸

Popular Names for Each Indian Caste

54k Resume dataset (structured)

Dataset Overview

Table Schemas

1. people.csv

2. abilities.csv

3. education.csv

4. experience.csv

4. person_skills.csv

5. skills.csv

Relationships

Data Characteristics

Date Formats

Text Fields

Identifiers

Common Usage Patterns

Basic Queries

Analytics Queries

🎖️ Medal of Honor Dataset

Data Dictionary

Acknowledgement

Namesakes