Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Motivation: creating challenging dataset for testing Named-Entity
Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods
Entities were collected as Wikipedia
text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.
The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).
Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.
Usage NotesEntities:
File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.
News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.
Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.
Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".
The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.
The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is composed of powerlifters who've competed in the United States at raw, full-power events from 2015 to now. All data was aggregated from OpenPowerlifting (see acknowledgments below) to create a smaller dataset with relevant American information. Use this dataset as a means to conduct EDA on American powerlifters and answer questions you may have on discrepancies among powerlifting community in the USA.
The file is composed of 38 columns of information such as basic lifter information, weights lifted, and competition location to select from. The following sub-sections provide a breakdown of each category in the dataset, as provided by OpenPowerlifting's README.txt document that accompanies downloading their dataset.
Mandatory. The name of the lifter in UTF-8 encoding.
Lifters who share the same name are distinguished by use of a # symbol followed by a unique number. For example, two lifters both named John Doe would have Name values John Doe #1 and John Doe #2 respectively.
Mandatory. The sex category in which the lifter competed, M, F, or Mx.
Mx (pronounced Muks) is a gender-neutral title — like Mr and Ms — originating from the UK. It is a catch-all sex category that is particularly appropriate for non-binary lifters.
Mandatory. The type of competition that the lifter entered. For the purposes of this dataset, all event values will be SBD for lifters who've competed in events testing their squat, bench, and deadlift
Mandatory. The equipment category under which the lifts were performed. For the purposes of this dataset, all values will be Raw, as it contains information about lifters who utilize minimal equipment
Optional. The age of the lifter on the start date of the meet, if known.
Ages can be one of two types: exact or approximate. Exact ages are given as integer numbers, for example 23. Approximate ages are given as an integer plus 0.5, for example 23.5.
Approximate ages mean that the lifter could be either of two possible ages. For an approximate age of n + 0.5, the possible ages are n or n+1. For example, a lifter with the given age 23.5 could be either 23 or 24 -- we don't have enough information to know.
Approximate ages occur because some federations only provide us with birth year information. So another way to think about approximate ages is that 23.5 implies that the lifter turns 24 that year.
Optional. The age class in which the filter falls, for example 40-45. These classes are based on exact age of the lifter on the day of competition.
AgeClass is mostly useful because sometimes a federation will report that a lifter competed in the 50-54 divison without providing any further age information. This way, we can still tag them as 50-54, even if the Age column is empty.
Optional. The birth year class in which the filter falls, for example 40-49. The ages in the range are the oldest possible ages for the lifter that year. For example, 40-49 means "the year the lifter turns 40 through the full year in which the lifter turns 49."
BirthYearClass is used primarily by the IPF and by IPF affiliates. Non-IPF federations tend to use AgeClass instead.
Optional. Free-form UTF-8 text describing the division of competition, like Open or Juniors 20-23 or Professional.
Some federations are configured in our database, which means that we have agreed on a limited set of division options for that federation, and we have rewritten their results to only use that set, and tests enforce that. Even still, divisions are not standardized between configured federations: it really is free-form text, just to provide context.
Information about age should not be extracted from the Division, but from the AgeClass column.
Optional. The recorded bodyweight of the lifter at the time of competition, to two decimal places.
Optional. The weight class in which the lifter competed, to two decimal places.
Weight classes can be specified as a maximum or as a minimum. Maximums are specified by just the number, for example 90 means "up to (and including) 90kg." minimums are specified by a + to the right of the number, for example 90+ means "above (and excluding) 90kg."
Optional. First attempts for each of squat, bench, and deadlift, respectively. Maximum of two decimal places.
Negative values indicate failed attempts.
Not all federations report attempt information. Some federations only report Best attempts.
Optional. Second attempts for each of squat, bench, and deadlift, respectively. Maximum of two decimal places.
Negative values indicate failed attempts.
Not all federations report attempt ...
Facebook
TwitterThis dataset represents the popular last names in the United States for White.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
SharkTank dataset of USA/American business reality television series. Currently, the data set has information from SharkTank season 1 to Shark Tank US season 16. The dataset has 53 fields/columns and 1440+ records.
Below are the features/fields in the dataset:
Facebook
TwitterThis dataset represents the popular names for each indian caste (Brahmin, Kshatriya, Vaisya and Sudra).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is aggregated from sources such as
Entirely available in the public domain.
Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.
This dataset contains structured information extracted from professional resumes, normalized into multiple related tables. The data includes personal information, educational background, work experience, professional skills, and abilities.
Primary table containing core information about each individual.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Unique identifier for each person | Primary Key, Not Null | 1 |
| name | VARCHAR(255) | Full name of the person | May be Null | "Database Administrator" |
| VARCHAR(255) | Email address | May be Null | "john.doe@email.com" | |
| phone | VARCHAR(50) | Contact number | May be Null | "+1-555-0123" |
| VARCHAR(255) | LinkedIn profile URL | May be Null | "linkedin.com/in/johndoe" |
Detailed abilities and competencies listed by individuals.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| ability | TEXT | Description of ability | Not Null | "Installation and Building Server" |
Contains educational history for each person.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| institution | VARCHAR(255) | Name of educational institution | May be Null | "Lead City University" |
| program | VARCHAR(255) | Degree or program name | May be Null | "Bachelor of Science" |
| start_date | VARCHAR(7) | Start date of education | May be Null | "07/2013" |
| location | VARCHAR(255) | Location of institution | May be Null | "Atlanta, GA" |
Details of work experience entries.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| title | VARCHAR(255) | Job title | May be Null | "Database Administrator" |
| firm | VARCHAR(255) | Company name | May be Null | "Family Private Care LLC" |
| start_date | VARCHAR(7) | Employment start date | May be Null | "04/2017" |
| end_date | VARCHAR(7) | Employment end date | May be Null | "Present" |
| location | VARCHAR(255) | Job location | May be Null | "Roswell, GA" |
Mapping table connecting people to their skills.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| skill | VARCHAR(255) | Reference to skills table | Foreign Key, Not Null | "SQL Server" |
Master list of unique skills mentioned across all resumes.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| skill | VARCHAR(255) | Unique skill name | Primary Key, Not Null | "SQL Server" |
-- Get all skills for a person
SELECT s.skill
FROM person_skills ps
JOIN skills s ON ps.skill = s.skill
WHERE ps.person_id = 1;
-- Get complete work history
SELECT *
FROM experience
WHERE person_id = 1
ORDER BY start_date DESC;
-- Most common skills
SELECT s.skill, COUNT(*) as frequency
FROM person_skills ps
...
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset has records for the awarding of the United States Medal of Honor. The Medal of Honor is the United States of America’s highest military honor, awarded for personal acts of valor above and beyond the call of duty. The medal is awarded by the President of the United States in the name of the U.S. Congress to U.S. military personnel only. There are three versions of the medal, one for the Army, one for the Navy, and one for the Air Force.[5] Personnel of the Marine Corps and Coast Guard receive the Navy version. The dataset was collected from the official military site, and includes records about how the medal was awarded and characteristics of the recipient. Unfortunately, because of the nature of century-old record keeping, many of the records are incomplete. While a very interesting dataset, it does have some missing data.
| Key | List of... | Comment | Example Value |
|---|---|---|---|
| death | Boolean | $MISSING_FIELD | True |
| name | String | $MISSING_FIELD | "Sagelhurst, John C." |
| awarded.General Order number | Integer | $MISSING_FIELD | -1 |
| awarded.accredited to | String | $MISSING_FIELD | "" |
| awarded.citation | String | $MISSING_FIELD | "Under a heavy fire from the enemy carried off the field a commissioned officer who was severely wounded and also led a charge on the enemy's rifle pits." |
| awarded.issued | String | $MISSING_FIELD | "01/03/1906" |
| birth.location name | String | $MISSING_FIELD | "Buffalo, N.Y." |
| metadata.link | String | $MISSING_FIELD | "http://www.cmohs.org/recipient-detail/1176/sagelhurst-john-c.php" |
| military record.company | String | $MISSING_FIELD | "Company B" |
| military record.division | String | $MISSING_FIELD | "1st New Jersey Cavalry" |
| military record.entered service at | String | $MISSING_FIELD | "Buffalo, N.Y." |
| military record.organization | String | $MISSING_FIELD | "U.S. Army" |
| military record.rank | String | $MISSING_FIELD | "Sergeant" |
| awarded.date.day | Integer | $MISSING_FIELD | 6 |
| awarded.date.full | String | $MISSING_FIELD | "1865-2-6" |
| awarded.date.month | Integer | $MISSING_FIELD | 2 |
| awarded.date.year | Integer | $MISSING_FIELD | 1865 |
| awarded.location.latitude | Integer | $MISSING_FIELD | 38 |
| awarded.location.longitude | Integer | $MISSING_FIELD | -77 |
| awarded.location.name | String | $MISSING_FIELD | "Hatchers Run Court, Stafford, VA 22554, USA" |
| birth.date.day | Integer | $MISSING_FIELD | -1 |
| birth.date.month | Integer | $MISSING_FIELD | -1 |
| birth.date.year | Integer | $MISSING_FIELD | -1 |
Foto von Samuel Branch auf Unsplash
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Motivation: creating challenging dataset for testing Named-Entity
Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods
Entities were collected as Wikipedia
text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.
The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).
Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.
Usage NotesEntities:
File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.
News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.
Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.
Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".
The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.
The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.