100+ datasets found
  1. Comprehensive Biomedical Entity Dataset

    • kaggle.com
    zip
    Updated Aug 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NikitPatel (2024). Comprehensive Biomedical Entity Dataset [Dataset]. https://www.kaggle.com/datasets/nikitpatel/comprehensive-biomedical-entity-dataset
    Explore at:
    zip(3741133 bytes)Available download formats
    Dataset updated
    Aug 22, 2024
    Authors
    NikitPatel
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Problem Statement Title: Automated Named Entity Recognition in Biomedical Texts Using Large Language Models

    Objective: Develop an automated system for Named Entity Recognition (NER) in biomedical texts that can accurately identify and categorize various entities within biomedical literature, such as PubMed abstracts, FDA drug descriptions, and patent abstracts. The system should classify these entities into 24 distinct categories, ranging from chemicals and clinical drugs to geographical areas and intellectual property.

    Challenges:

    Entity Diversity: The dataset includes a wide range of entity types, some specific to biology and medicine (e.g., anatomical structures, genes) and others more general (e.g., geographical areas, organizations). The system needs to be capable of distinguishing between these varied categories. Complex Biomedical Terminology: The text often includes highly specialized terminology, which can be challenging to recognize and categorize accurately. Overlapping Entities: Some entities might overlap in their classifications (e.g., a gene product might also be considered a chemical), making it essential for the model to correctly disambiguate between them. Imbalanced Data: Certain entity types may be more prevalent in the dataset than others, potentially leading to biased predictions if not handled correctly.

  2. e

    Relationship and Entity Extraction Evaluation Dataset (Entities)

    • data.europa.eu
    • data.wu.ac.at
    json
    Updated Oct 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Defence Science and Technology Laboratory (2021). Relationship and Entity Extraction Evaluation Dataset (Entities) [Dataset]. https://data.europa.eu/data/datasets/relationship-and-entity-extraction-evaluation-dataset-entities
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 30, 2021
    Dataset authored and provided by
    Defence Science and Technology Laboratory
    Description

    This entities dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'entities.json' file is to be used together with the 'documents.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.

  3. m

    English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

    • data.mendeley.com
    Updated Feb 9, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
    Explore at:
    Dataset updated
    Feb 9, 2017
    Authors
    H. Bahadir Sahin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

    Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

    By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

    We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

    All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.

  4. E

    The Human Know-How Dataset

    • find.data.gov.scot
    • dtechtive.com
    pdf, zip
    Updated Apr 29, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
    Explore at:
    zip(19.78 MB), zip(0.2837 MB), zip(19.67 MB), zip(69.8 MB), zip(9.433 MB), zip(62.92 MB), zip(20.43 MB), zip(43.28 MB), zip(92.88 MB), zip(13.06 MB), zip(14.86 MB), zip(5.372 MB), zip(0.0298 MB), pdf(0.0582 MB), zip(5.769 MB), zip(90.08 MB)Available download formats
    Dataset updated
    Apr 29, 2016
    Description

    The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)

  5. Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  6. u

    Labelled FHYA Dataset

    • zivahub.uct.ac.za
    txt
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarryd Dunn (2022). Labelled FHYA Dataset [Dataset]. http://doi.org/10.25375/uct.19029692.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    University of Cape Town
    Authors
    Jarryd Dunn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This collection contains a the datasets created as part of a masters thesis. The collection consists of two datasets in two forms as well as the corresponding entity descriptions for each of the datasets.The experiment_doc_labels_clean documents contain the data used for the experiments. The JSON file consists of a list of JSON objects. The JSON objects contain the following fields: id: Document idner_tags: List of IOB tags indicating mention boundaries based on the majority label assigned using crowdsourcing.el_tags: List of entity ids based on the majority label assigned using crowdsourcing.all_ner_tags: List of lists of IOB tags assigned by each of the users.all_el_tags: List of lists of entity IDs assigned by each of the users annotating the data.tokens: List of tokens from the text.The experiment_doc_labels_clean-U.tsv contains the dataset used for the experiments but in in a format similar to the CoNLL-U format. The first line for each document contains the document ID. The documents are separated by a blank line. Each word in a document is on its own line consisting of the word the IOB tag and the entity id separated by tags.While the experiments were being completed the annotation system was left open until all the documents had been annotated by three users. This resulted in the all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv datasets. The all_docs_complete_labels_clean.json and all_docs_complete_labels_clean-U.tsv documents take the same form as the experiment_doc_labels_clean.json and experiment_doc_labels_clean-U.tsv.Each of the documents described above contain an entity id. The IDs match to the entities stored in the entity_descriptions CSV files. Each of row in these files corresponds to a mention for an entity and take the form:{ID}${Mention}${Context}[N]Three sets of entity descriptions are available:1. entity_descriptions_experiments.csv: This file contains all the mentions from the subset of the data used for the experiments as described above. However, the data has not been cleaned so there are multiple entity IDs which actually refer to the same entity.2. entity_descriptions_experiments_clean.csv: These entities also cover the data used for the experiments, however, duplicate entities have been merged. These entities correspond to the labels for the documents in the experiment_doc_labels_clean files.3. entity_descriptions_all.csv: The entities in this file correspond to the data in the all_docs_complete_labels_clean. Please note that the entities have not been cleaned so there may be duplicate or incorrect entities.

  7. Identifying Diseases Treatments in Healthcare Data

    • kaggle.com
    zip
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagar Maru (2025). Identifying Diseases Treatments in Healthcare Data [Dataset]. https://www.kaggle.com/datasets/marusagar/identifying-diseases-treatments-in-healthcare-data
    Explore at:
    zip(166655 bytes)Available download formats
    Dataset updated
    Mar 5, 2025
    Authors
    Sagar Maru
    Description

    Identifying Entities (Diseases, Treatments) in Healthcare Data

    Finding diseases and treatments in medical text—because even AI needs a medical degree to understand doctor’s notes! 🩺🤖

    📊 Understanding the Dataset

    In the contemporary healthcare ecosystem, substantial amounts of unstructured textual facts are generated day by day thru electronic health facts (EHRs), medical doctor’s notes, prescriptions, and medical literature. The potential to extract meaningful insights from this records is critical for improving patient care, advancing clinical studies, and optimizing healthcare offerings. The dataset in cognizance incorporates text-based totally scientific statistics, in which sicknesses and their corresponding remedies are embedded inside unstructured sentences.

    The dataset consists of categorized textual content samples, that are classified into: -**Train Sentences**: These sentences comprise clinical records, including patient diagnoses and the treatments administered. -**Train Labels**: The corresponding annotations for the train sentences, marking diseases and remedies as named entities. -**Test Sentences**: Similar to educate sentences however used to evaluate model overall performance. -**Test Labels**: The ground reality labels for the test sentences.

    A sneak from the dataset may look as follows:

    🔍 Example from Dataset:

    Train Sentences:

    _ "The patient was a 62 -year -old man with squamous epithelium, who was previously treated with success with a combination of radiation therapy and chemotherapy."

    Train Labels:

    • Disease: 🦠 lung cancer
    • Treatment: 💉 Radiation therapy, chemotherapy

    This dataset requires the use of** designated Unit Recognition (NER)** to remove and map and map diseases for related treatments 💊, causing the composition of unarmed medical data for analytical purposes.

    ⚙️ Dataset Properties

    1. Unnecessary medical text: Data set contains free-powered medical notes, where disease and treatment conditions are clearly mentioned. Removing this information without clear mapping is a challenge.
    2. Many unit types: Datasets contain different - -called institutions such as diseases, treatment, symptoms and possibly medication.
    3. Relevant addiction: Many treatments apply to many diseases, and proper mapping depends on reference. For example, "radiotherapy" is used for different cancers, which makes relevant understanding significantly.
    4. Unbalanced data distribution: Some diseases and treatment can be displayed more often than others, to balance model performance requires techniques such as overfalling, sub -sampling or transmission of learning.
    5. Domain-specific language: is rich in lesson medical terminology, which requires special preprochet using domain-specific NLP techniques and medical oncology such as UML or SNOM CT.

    🚧 Challenges Working with Dataset

    • Complex medical vocabulary: Medical texts often use vocals, which require special NLP models that are trained at the clinical company.

    • Implicit Relationships: Unlike based datasets, ailment-treatment relationships are inferred from context in preference to explicitly stated.

    • Synonyms and Abbreviations: Diseases and treatments can be cited the use of special names (e.G., ‘myocardial infarction’ vs. ‘coronary heart assault’). Handling such versions is vital.

    • Noise in Data: Unstructured records may additionally contain irrelevant records, typographical errors, and inconsistencies that affect extraction accuracy.

    🛠️ Approach to Extracting Insights from the Dataset

    To extract sicknesses and their respective treatments from this dataset, we follow a based NLP pipeline:

    1. Data Preprocessing 🧹

    • Text Cleaning: Remove needless characters, numbers, and stopwords whilst preserving clinical terms.
    • Tokenization: Split sentences into phrases for higher processing.
    • Medical Term Standardization: Use area-precise libraries like SciSpacy to standardize synonyms and abbreviations.

    2. Named Entity Recognition (NER) Model Development 🤖

    • Annotation: Ensure accurate labeling of sicknesses and treatments in the dataset.
    • Model Selection: Train a deep-mastering-based version like BioBERT or a rule-based model the use of spaCy.
    • Training: Use annotated data to teach a custom NER model that classifies words as sickness or treatment entities.
    • Evaluation: Measure precision, bear in mind, and F1-score to evaluate version overall performance.

    3. Mapping Diseases to Treatments 🔄

    • Contextual Relationship Extraction: Identify which treatment corresponds to which sickness using dependency parsing and courting extraction.
    • Dictionary or Tabular Output: Store extracted mappings in a based layout.

    Example Output:

    | 🦠 Disease | 💉 Treatments | |----------|--------------------...

  8. d

    National Incorporated Places and Counties

    • catalog.data.gov
    • s.cnmilf.com
    Updated Sep 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.iowa.gov (2023). National Incorporated Places and Counties [Dataset]. https://catalog.data.gov/dataset/national-incorporated-places-and-counties
    Explore at:
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    data.iowa.gov
    Description

    This dataset contains a listing of incorporated places (cities and towns) and counties within the United States including the GNIS code, FIPS code, name, entity type and primary point (location) for the entity. The types of entities listed in this dataset are based on codes provided by the U.S. Census Bureau, and include the following: C1 - An active incorporated place that does not serve as a county subdivision equivalent; C2 - An active incorporated place legally coextensive with a county subdivision but treated as independent of any county subdivision; C3 - A consolidated city; C4 - An active incorporated place with an alternate official common name; C5 - An active incorporated place that is independent of any county subdivision and serves as a county subdivision equivalent; C6 - An active incorporated place that partially is independent of any county subdivision and serves as a county subdivision equivalent or partially coextensive with a county subdivision but treated as independent of any county subdivision; C7 - An incorporated place that is independent of any county; C8 - The balance of a consolidated city excluding the separately incorporated place(s) within that consolidated government; C9 - An inactive or nonfunctioning incorporated place; H1 - An active county or statistically equivalent entity; H4 - A legally defined inactive or nonfunctioning county or statistically equivalent entity; H5 - A census areas in Alaska, a statistical county equivalent entity; and H6 - A county or statistically equivalent entity that is areally coextensive or governmentally consolidated with an incorporated place, part of an incorporated place, or a consolidated city.

  9. Z

    Data from: Tough Tables: Carefully Evaluating Entity Linking for Tabular...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jan 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cutrona, Vincenzo; Bianchi, Federico; Jiménez-Ruiz, Ernesto; Palmonari, Matteo (2023). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3840646
    Explore at:
    Dataset updated
    Jan 14, 2023
    Dataset provided by
    Bocconi University
    City, University of London
    University of Milano - Bicocca
    Authors
    Cutrona, Vincenzo; Bianchi, Federico; Jiménez-Ruiz, Ernesto; Palmonari, Matteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Tough Tables (2T) is a dataset designed to evaluate table annotation approaches in solving the CEA and CTA tasks. The dataset is compliant with the data format used in SemTab 2019, and it can be used as an additional dataset without any modification. The target knowledge graph is DBpedia 2016-10. Check out the 2T GitHub repository for more details about the dataset generation.

    New in v2.0: We release the updated version of 2T_WD! The target knowledge graph is Wikidata (online instance) and the dataset complies with the SemTab 2021 data format.

    This work is based on the following paper:

    Cutrona, V., Bianchi, F., Jimenez-Ruiz, E. and Palmonari, M. (2020). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data. ISWC 2020, LNCS 12507, pp. 1–16.

    Note on License: This dataset includes data from the following sources. Refer to each source for license details: - Wikipedia https://www.wikipedia.org/ - DBpedia https://dbpedia.org/ - Wikidata https://www.wikidata.org/ - SemTab 2019 https://doi.org/10.5281/zenodo.3518539 - GeoDatos https://www.geodatos.net - The Pudding https://pudding.cool/ - Offices.net https://offices.net/ - DATA.GOV https://www.data.gov/

    THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

    Changelog:

    v2.0

    New GT for 2T_WD

    A few entities have been removed from the CEA GT, because they are no longer represented in WD (e.g., dbr:Devonté points to wd:Q21155080, which does not exist)

    Tables codes and values differ from the previous version, because of the random noise.

    Updated ancestor/descendant hierarchies to evaluate CTA.

    v1.0

    New Wikidata version (2T_WD)

    Fix header for tables CTRL_DBP_MUS_rock_bands_labels.csv and CTRL_DBP_MUS_rock_bands_labels_NOISE2.csv (column 2 was reported with id 1 in target - NOTE: the affected column has been removed from the SemTab2020 evaluation)

    Remove duplicated entries in tables

    Remove rows with wrong values (e.g., the Kazakhstan entity has an empty name "''")

    Many rows and noised columns are shuffled/changed due to the random noise generator algorithm

    Remove row "Florida","Floorida","New York, NY" from TOUGH_WEB_MISSP_1000_us_cities.csv (and all its NOISE1 variants)

    Fix header of tables:

    CTRL_WIKI_POL_List_of_current_monarchs_of_sovereign_states.csv

    CTRL_WIKI_POL_List_of_current_monarchs_of_sovereign_states_NOISE2.csv

    TOUGH_T2D_BUS_29414811_2_4773219892816395776_videogames_developers.csv

    TOUGH_T2D_BUS_29414811_2_4773219892816395776_videogames_developers_NOISE2.csv

    v0.1-pre

    First submission. It contains only tables, without GT and Targets.

  10. Counties

    • catalog.data.gov
    • datasets.ai
    • +5more
    Updated Jul 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Census Bureau (USCB) (Point of Contact) (2025). Counties [Dataset]. https://catalog.data.gov/dataset/counties2
    Explore at:
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Description

    The Counties dataset was updated on October 31, 2023 from the United States Census Bureau (USCB) and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). This resource is a member of a series. The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The primary legal divisions of most states are termed counties. In Louisiana, these divisions are known as parishes. In Alaska, which has no counties, the equivalent entities are the organized boroughs, city and boroughs, municipalities, and for the unorganized area, census areas. The latter are delineated cooperatively for statistical purposes by the State of Alaska and the Census Bureau. In four states (Maryland, Missouri, Nevada, and Virginia), there are one or more incorporated places that are independent of any county organization and thus constitute primary divisions of their states. These incorporated places are known as independent cities and are treated as equivalent entities for purposes of data presentation. The District of Columbia and Guam have no primary divisions, and each area is considered an equivalent entity for purposes of data presentation. The Census Bureau treats the following entities as equivalents of counties for purposes of data presentation: Municipios in Puerto Rico, Districts and Islands in American Samoa, Municipalities in the Commonwealth of the Northern Mariana Islands, and Islands in the U.S. Virgin Islands. The entire area of the United States, Puerto Rico, and the Island Areas is covered by counties or equivalent entities. The boundaries for counties and equivalent entities are mostly as of January 1, 2023, as reported through the Census Bureau's Boundary and Annexation Survey (BAS). A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1529015

  11. e

    Simple download service (Atom) of the dataset: Linear entities at the origin...

    • data.europa.eu
    unknown
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Simple download service (Atom) of the dataset: Linear entities at the origin of the risk of the RPP Inondation de La Loue [Dataset]. https://data.europa.eu/data/datasets/fr-120066022-srv-c9f48f89-301b-4774-997d-47521b5d4d93?locale=en
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jan 26, 2022
    Description

    The origin of the risk characterises the real-world entity which, through its presence, represents a potential risk. This origin may be characterised by a name and, in some cases, a geographical object locating the actual entity causing the risk. The location of the entity and the knowledge of the dangerous phenomenon are used to define the risk pools, the risk-exposed areas that underpin the RPP. For NRPPs, this entity may, for example, correspond to a watercourse, a geologically unstable area.

  12. New York Times Relation Extraction Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shantanu Tripathi (2021). New York Times Relation Extraction Dataset [Dataset]. https://www.kaggle.com/daishinkan002/new-york-times-relation-extraction-dataset
    Explore at:
    zip(9453175 bytes)Available download formats
    Dataset updated
    Jul 31, 2021
    Authors
    Shantanu Tripathi
    Description

    Context

    Relationship extraction is the task of extracting semantic relationships in a text. Extracted relationships usually occur between two or more entities in a particular text (e.g. person, organization, location) and can fall into many categories (e.g. married, employed, residential)

    Content

    This dataset contains 24 types of relations that may occur in a sentence. Each sentence can have more than 1 relation.

    Type of Relations include : -

    1. /location/location/contains
    2. /people/person/nationality
    3. /people/person/place_lived
    4. /business/person/company
    5. /location/country/capital
    6. /location/neighborhood/neighborhood_of
    7. /people/person/place_of_birth
    8. /location/country/administrative_divisions
    9. /location/administrative_division/country
    10. /people/deceased_person/place_of_death
    11. /people/person/children
    12. /business/company/founders
    13. /business/company/place_founded
    14. /business/company_shareholder/major_shareholder_of
    15. /sports/sports_team_location/teams
    16. /sports/sports_team/location
    17. /business/company/major_shareholders
    18. /people/person/religion
    19. /business/company/advisors
    20. /people/ethnicity/geographic_distribution
    21. /people/person/ethnicity
    22. /people/ethnicity/people
    23. /people/person/profession
    24. /business/company/industry

    Acknowledgements

    This dataset wouldn't be possible without New York Times.

    Inspiration

    This dataset can play a significant role in Modelling many Natural Language Processing applications ( such as Sentence Level Relation Extraction etc. )

    PLEASE UPVOTE IF YOU LIKE IT

  13. h

    bcorp_web

    • huggingface.co
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhuvanesh Verma (2023). bcorp_web [Dataset]. https://huggingface.co/datasets/bhuvi/bcorp_web
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2023
    Authors
    Bhuvanesh Verma
    Description

    Dataset Summary

    This dataset contains web text crawled using Hyphe on B Corp website. Hyphe found more than 1000 outlinks from B Corp website among which many entities were B Corp certified organisations. Given dataset contains webtext for those organisations. List of B Corp certified organisation is dynamic so we only select around 600 organisation in this dataset. There is no specific criteria for this selection.

      Languages
    

    Primarily English, but contains we data… See the full description on the dataset page: https://huggingface.co/datasets/bhuvi/bcorp_web.

  14. a

    Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links...

    • academictorrents.com
    bittorrent
    Updated Mar 4, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum (2017). Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset) [Dataset]. https://academictorrents.com/details/689af6f153e097538ad7b8fd4ea3e87ce8f6bc42
    Explore at:
    bittorrent(194817430579)Available download formats
    Dataset updated
    Mar 4, 2017
    Dataset authored and provided by
    Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people. ### Introduction The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. conta

  15. Learning Privacy from Visual Entities - Curated data sets and pre-computed...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro (2025). Learning Privacy from Visual Entities - Curated data sets and pre-computed visual entities [Dataset]. http://doi.org/10.5281/zenodo.15348506
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This repository contains the curated image privacy datasets and pre-computed visual entities used in the publication Learning Privacy from Visual Entities by A. Xompero and A. Cavallaro.
    [
    arxiv][code]

    Curated image privacy data sets

    In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.

    Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.

    Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.

    List of datasets and their original source:

    Notes:

    • For PicAlert and PrivacyAlert, only urls to the original locations in Flickr are available in the Zenodo record
    • Collector and authors of the PrivacyAlert dataset selected the images from Flickr under Public Domain license
    • Owners of the photos on Flick could have removed the photos from the social media platform
    • Running the bash scripts to download the images can incur in the "429 Too Many Requests" status code

    Pre-computed visual entitities

    Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).

    For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.

    Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.

    Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)

    Enquiries, questions and comments

    If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.

  16. e

    Relationship and Entity Extraction Evaluation Dataset (Documents)

    • data.europa.eu
    • data.wu.ac.at
    json
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Defence Science and Technology Laboratory (2022). Relationship and Entity Extraction Evaluation Dataset (Documents) [Dataset]. https://data.europa.eu/data/datasets/relationship-and-entity-extraction-evaluation-dataset?locale=ga
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 30, 2022
    Dataset authored and provided by
    Defence Science and Technology Laboratory
    Description

    This document dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'documents.json' file is to be used together with the 'entities.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.

  17. a

    Data from: Public Health Departments

    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    • nconemap.gov
    • +3more
    Updated Jan 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CA Governor's Office of Emergency Services (2018). Public Health Departments [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/maps/29c3979a34ba4d509582a0e2adf82fd3
    Explore at:
    Dataset updated
    Jan 17, 2018
    Dataset authored and provided by
    CA Governor's Office of Emergency Services
    Area covered
    Description

    State and Local Public Health Departments in the United States Governmental public health departments are responsible for creating and maintaining conditions that keep people healthy. A local health department may be locally governed, part of a region or district, be an office or an administrative unit of the state health department, or a hybrid of these. Furthermore, each community has a unique "public health system" comprising individuals and public and private entities that are engaged in activities that affect the public's health. (Excerpted from the Operational Definition of a functional local health department, National Association of County and City Health Officials, November 2005) Please reference http://www.naccho.org/topics/infrastructure/accreditation/upload/OperationalDefinitionBrochure-2.pdf for more information. Facilities involved in direct patient care are intended to be excluded from this dataset; however, some of the entities represented in this dataset serve as both administrative and clinical locations. This dataset only includes the headquarters of Public Health Departments, not their satellite offices. Some health departments encompass multiple counties; therefore, not every county will be represented by an individual record. Also, some areas will appear to have over representation depending on the structure of the health departments in that particular region. Town health officers are included in Vermont and boards of health are included in Massachusetts. Both of these types of entities are elected or appointed to a term of office during which they make and enforce policies and regulations related to the protection of public health. Visiting nurses are represented in this dataset if they are contracted through the local government to fulfill the duties and responsibilities of the local health organization. Since many town health officers in Vermont work out of their personal homes, TechniGraphics represented these entities at the town hall. This is denoted in the [DIRECTIONS] field. Effort was made by TechniGraphics to verify whether or not each health department tracks statistics on communicable diseases. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard HSIP fields populated by TechniGraphics. Double spaces were replaced by single spaces in these same fields. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on this field, the oldest record dates from 11/18/2009 and the newest record dates from 01/08/2010.

  18. Z

    IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

    • data.niaid.nih.gov
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gusmita, Ria Hari; Firmansyah, Asep Fajar (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
    Explore at:
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Islamic State University Syarif Hidayatullah Jakarta, Paderborn University
    Authors
    Gusmita, Ria Hari; Firmansyah, Asep Fajar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndQNER

    IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

    3117 sentences

    62027 tokens

    2475 named entities

    18 named entity categories

    Named Entity Classes

    The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

    Allah

    Allah's Throne

    Artifact

    Astronomical body

    Event

    False deity

    Holy book

    Language

    Angel

    Person

    Messenger

    Prophet

    Sentient

    Afterlife location

    Geographical location

    Color

    Religion

    Food

    Fruit

    The book of Allah

    Annotation Stage

    There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

    Anggita Maharani Gumay Putri

    Muhammad Destamal Junas

    Naufaldi Hafidhigbal

    Nur Kholis Azzam Ubaidillah

    Puspitasari

    Septiany Nur Anggita

    Wilda Nurjannah

    William Santoso

    Verification Stage

    We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

    Dr. Eva Nugraha, M.Ag.

    Dr. Jauhar Azizy, MA

    Dr. Lilik Ummi Kultsum, MA

    Evaluation

    We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

    Supervised Learning Setting

    The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.94 0.92 0.93

    256 20 0.99 0.97 0.98

    256 40 0.96 0.96 0.96

    256 100 0.97 0.96 0.96

    512 10 0.92 0.92 0.92

    512 20 0.96 0.95 0.96

    512 40 0.97 0.95 0.96

    512 100 0.97 0.95 0.96

    Transfer Learning Setting

    We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.67 0.65 0.65

    256 20 0.60 0.59 0.59

    256 40 0.75 0.72 0.71

    256 100 0.73 0.68 0.68

    512 10 0.72 0.62 0.64

    512 20 0.62 0.57 0.58

    512 40 0.72 0.66 0.67

    512 100 0.68 0.68 0.67

    This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

    How to Cite

    @InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

    Contact

    If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

  19. d

    Insurance Producer Business Entities Licensed in Iowa

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Nov 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.iowa.gov (2025). Insurance Producer Business Entities Licensed in Iowa [Dataset]. https://catalog.data.gov/dataset/insurance-producer-business-entities-licensed-in-iowa
    Explore at:
    Dataset updated
    Nov 22, 2025
    Dataset provided by
    data.iowa.gov
    Area covered
    Iowa
    Description

    All individual insurance producers conducting business in Iowa must be licensed. Business entities, such as corporations, associations, partnerships, limited liabilities companies, limited liability partnerships, or other legal entities may choose to become licensed by completing an application and paying the license fee through the National Insurance Producer Registry (NIPR), following the requirements in Iowa Administrative Code rule 191.10.18. This dataset contains a listing of producer business entities licensed in Iowa

  20. W

    Emergency Medical Service Stations

    • wifire-data.sdsc.edu
    • gis-calema.opendata.arcgis.com
    csv, esri rest +4
    Updated May 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CA Governor's Office of Emergency Services (2019). Emergency Medical Service Stations [Dataset]. https://wifire-data.sdsc.edu/dataset/emergency-medical-service-stations
    Explore at:
    geojson, zip, csv, kml, html, esri restAvailable download formats
    Dataset updated
    May 22, 2019
    Dataset provided by
    CA Governor's Office of Emergency Services
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description
    The dataset represents Emergency Medical Services (EMS) locations in the United States and its territories. EMS Stations are part of the Fire Stations / EMS Stations HSIP Freedom sub-layer, which in turn is part of the Emergency Services and Continuity of Government Sector, which is itself a part of the Critical Infrastructure Category. The EMS stations dataset consists of any location where emergency medical service (EMS) personnel are stationed or based out of, or where equipment that such personnel use in carrying out their jobs is stored for ready use. Ambulance services are included even if they only provide transportation services, but not if they are located at, and operated by, a hospital. If an independent ambulance service or EMS provider happens to be collocated with a hospital, it will be included in this dataset. The dataset includes both private and governmental entities. A concerted effort was made to include all emergency medical service locations in the United States and its territories. This dataset is comprised completely of license free data. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based upon this field, the oldest record dates from 12/29/2004 and the newest record dates from 01/11/2010.

    This dataset represents the EMS stations of any location where emergency medical service (EMS) personnel are stationed or based out of, or where equipment that such personnel use in carrying out their jobs is stored for ready use. Homeland Security Use Cases: Use cases describe how the data may be used and help to define and clarify requirements. 1. An assessment of whether or not the total emergency medical services capability in a given area is adequate. 2. A list of resources to draw upon by surrounding areas when local resources have temporarily been overwhelmed by a disaster - route analysis can determine those entities that are able to respond the quickest. 3. A resource for Emergency Management planning purposes. 4. A resource for catastrophe response to aid in the retrieval of equipment by outside responders in order to deal with the disaster. 5. A resource for situational awareness planning and response for Federal Government events.


Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NikitPatel (2024). Comprehensive Biomedical Entity Dataset [Dataset]. https://www.kaggle.com/datasets/nikitpatel/comprehensive-biomedical-entity-dataset
Organization logo

Comprehensive Biomedical Entity Dataset

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
zip(3741133 bytes)Available download formats
Dataset updated
Aug 22, 2024
Authors
NikitPatel
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Problem Statement Title: Automated Named Entity Recognition in Biomedical Texts Using Large Language Models

Objective: Develop an automated system for Named Entity Recognition (NER) in biomedical texts that can accurately identify and categorize various entities within biomedical literature, such as PubMed abstracts, FDA drug descriptions, and patent abstracts. The system should classify these entities into 24 distinct categories, ranging from chemicals and clinical drugs to geographical areas and intellectual property.

Challenges:

Entity Diversity: The dataset includes a wide range of entity types, some specific to biology and medicine (e.g., anatomical structures, genes) and others more general (e.g., geographical areas, organizations). The system needs to be capable of distinguishing between these varied categories. Complex Biomedical Terminology: The text often includes highly specialized terminology, which can be challenging to recognize and categorize accurately. Overlapping Entities: Some entities might overlap in their classifications (e.g., a gene product might also be considered a chemical), making it essential for the model to correctly disambiguate between them. Imbalanced Data: Certain entity types may be more prevalent in the dataset than others, potentially leading to biased predictions if not handled correctly.

Search
Clear search
Close search
Google apps
Main menu