71 datasets found

E
Annotated Web Tables
live.european-language-grid.eu
csv
Updated Sep 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Annotated Web Tables [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7387
Explore at:
csvAvailable download formats
Dataset updated
Sep 25, 2021
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data sets used for experimental evaluation in the related publication:Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity EmbeddingsInternational Semantic Web Conference (1) 2017: 260-277Vasilis Efthymiou Oktie Hassanzadeh Mariano Rodríguez-Muro Vassilis ChristophidesThe gold standard data sets are collections of web tables:T2D (v1) consists of a schema-level gold standard of 1,748 Web tables, manually annotated with class- and property-mappings, as well as an entity-level gold standard of 233 Web tables.Limaye consists of 400 manually annotated Web tables with entity-, class-, and property-level correspondences, where single cells (not rows) are mapped to entities. The corrected version of this gold standard is adapted to annotate rows with entities, from the annotations of the label column cells.WikipediaGS is an instance-level gold standard developed from 485K Wikipedia tables, in which links in the label column are used to infer the annotation of a row to a DBpedia entity. Note on license: please refer to the README.txt. Data is derived from Wikipedia and other sources may have different licenses.Wikipedia contents can be shared under the terms of Creative Commons Attribution-ShareAlike Licenseas outlined on Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_contentThe correspondences of the T2D Gold standard is provided under the terms of the Apache license. The Web tables are provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The DBpedia subset is licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License that applies to DBpedia.Limaye gold standard is downloaded from: http://websail-fe.cs.northwestern.edu/TabEL/ (download date: August 25, 2016). Please refer to the original website and the following paper for more details and citation information:G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB, 3(1):1338â€“1347, 2010.THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
AOP-Wiki Event Component Annotation
catalog.data.gov
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). AOP-Wiki Event Component Annotation [Dataset]. https://catalog.data.gov/dataset/aop-wiki-event-component-annotation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset contains ontology terms associated with key events from the AOP-Wiki. This information was used to seed the AOP-Wiki with a carefully selected set of ontology terms prior to opening up the option for authors to tag their own AOPs. This is intended to provide existing examples for authors and improve consistency when assigning terms to the key events. This dataset is associated with the following publication: Ives, C., I. Campia, R. Wang, C. Wittwehr, and S. Edwards. Creating a Structured Adverse Outcome Pathway Knowledgebase via Ontology-Based Annotations. Applied In Vitro Toxicology. Mary Ann Liebert, Inc., Larchmont, NY, USA, 3(4): 298-311, (2017).
Z
Wikipedia video games similarity dataset with expert annotations
data.niaid.nih.gov
zenodo.org
Updated Feb 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dvir Ginzburg (2022). Wikipedia video games similarity dataset with expert annotations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6088354
Explore at:
Dataset updated
Feb 16, 2022
Dataset authored and provided by
Dvir Ginzburg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A video games NLP dataset extracted from Wikipedia.

For all articles, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these picese.

Provided as well are 90 seeds with recommended articles, annotated by human experts.
m
English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...
data.mendeley.com
Updated Feb 9, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
Explore at:
Unique identifier
https://doi.org/10.17632/cdcztymf4k.1
Dataset updated
Feb 9, 2017
Authors
H. Bahadir Sahin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
f
Data from: Learning multilingual named entity recognition from Wikipedia
figshare.com
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran (2023). Learning multilingual named entity recognition from Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.5462500.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5462500.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data associated with Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy and James R. Curran (2013), "Learning multilingual named entity recognition from Wikipedia", Artificial Intelligence 194 (DOI: 10.1016/j.artint.2012.03.006). A preprint is included here as wikiner-preprint.pdfThis data was originally available at http://schwa.org/resources (which linked to http://schwa.org/projects/resources/wiki/Wikiner).The .bz2 files are NER training corpora produced as reported in the Artificial Intelligence paper. wp2 and wp3 are differentiated by wp3 using a higher level of link inference. They use a pipe-delimited format that can be converted to CoNLL 2003 format with system2conll.pl.nothman08types.tsv is a manual classification of articles first used in Joel Nothman, James R. Curran and Tara Murphy (2008), "Transforming Wikipedia into Named Entity Training Data", In Proceedings of the Australasian Language Technology Association Workshop 2008. http://aclanthology.coli.uni-saarland.de/pdf/U/U08/U08-1016.pdfpopular.tsv and random.tsv are manual article classifications developed for the Artifiical Intelligence paper based on different strategies for sampling articles from Wikipedia in order to account for Wikipedia's biased distribution (see that paper). scheme.tsv maps these fine-grained labels to coarser annotations including CoNLL 2003-style.wikigold.conll.txt is a manual NER annotation of some Wikipedia text as presented in Dominic Balasuriya and Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran (2009), in Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources (http://www.aclweb.org/anthology/W/W09/W09-3302).See also corpora produced similarly in an enhanced version of this work work (Pan et al., "Cross-lingual Name Tagging and Linking for 282 Languages", ACL 2017) at http://nlp.cs.rpi.edu/wikiann/.
Z
Wikipedia Gene Ontology Annotations (WikiGOA)
data.niaid.nih.gov
Updated Sep 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tiago Lubiana (2022). Wikipedia Gene Ontology Annotations (WikiGOA) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6624353
Explore at:
Dataset updated
Sep 15, 2022
Dataset provided by
Thomaz Luscher Dias
Tiago Lubiana
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Gene Ontology Annotation gene sets obtained via Wikidata.

Gene Ontology terms with English Wikipedia pages were obtained from Wikidata using the SPARQL query language via the Wikidata Query Service: https://w.wiki/5DsS

Gene Ontology Annotations were obtained indirectly from the UniProt-GoA database (https://www.ebi.ac.uk/GOA/publications) via mappings on Wikidata done by the ProteinBoxBot (https://elifesciences.org/articles/52614)

T.L. is funded by grant #2019/26284-1 from the São Paulo Research Foundation (FAPESP). T.L.D. is funded by grant from VSV-EBOPLUS Consortium.
n
Dataset of Pairs of an Image and Tags for Cataloging Image-based Records
narcis.nl
data.mendeley.com
Updated Apr 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzuki, T (via Mendeley Data) (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.2
Explore at:
Unique identifier
https://doi.org/10.17632/msyc6mzvhg.2
Dataset updated
Apr 19, 2022
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Suzuki, T (via Mendeley Data)
Description
Brief ExplanationThis dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results. ## Structure of the Dataset1. data directory 1.1. image_URL.txt This file lists URLs of image files. 1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt 1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated. 1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).2. img directory This directory is a placeholder directory to fetch image files for downloading.3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance. 4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.
P
DAWT Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nemanja Spasojevic; Preeti Bhargava; Guoning Hu, DAWT Dataset [Dataset]. https://paperswithcode.com/dataset/dawt
Explore at:
Authors
Nemanja Spasojevic; Preeti Bhargava; Guoning Hu
Description
The DAWT dataset consists of Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic.
Z
Data from: Tough Tables: Carefully Evaluating Entity Linking for Tabular...
data.niaid.nih.gov
zenodo.org
Updated Jan 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cutrona, Vincenzo (2023). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3840646
Explore at:
Dataset updated
Jan 14, 2023
Dataset provided by
Palmonari, Matteo
Jiménez-Ruiz, Ernesto
Bianchi, Federico
Cutrona, Vincenzo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Tough Tables (2T) is a dataset designed to evaluate table annotation approaches in solving the CEA and CTA tasks. The dataset is compliant with the data format used in SemTab 2019, and it can be used as an additional dataset without any modification. The target knowledge graph is DBpedia 2016-10. Check out the 2T GitHub repository for more details about the dataset generation.

New in v2.0: We release the updated version of 2T_WD! The target knowledge graph is Wikidata (online instance) and the dataset complies with the SemTab 2021 data format.

This work is based on the following paper:

Cutrona, V., Bianchi, F., Jimenez-Ruiz, E. and Palmonari, M. (2020). Tough Tables: Carefully Evaluating Entity Linking for Tabular Data. ISWC 2020, LNCS 12507, pp. 1–16.

Note on License: This dataset includes data from the following sources. Refer to each source for license details: - Wikipedia https://www.wikipedia.org/ - DBpedia https://dbpedia.org/ - Wikidata https://www.wikidata.org/ - SemTab 2019 https://doi.org/10.5281/zenodo.3518539 - GeoDatos https://www.geodatos.net - The Pudding https://pudding.cool/ - Offices.net https://offices.net/ - DATA.GOV https://www.data.gov/

THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Changelog:

v2.0

New GT for 2T_WD

A few entities have been removed from the CEA GT, because they are no longer represented in WD (e.g., dbr:Devonté points to wd:Q21155080, which does not exist)

Tables codes and values differ from the previous version, because of the random noise.

Updated ancestor/descendant hierarchies to evaluate CTA.

v1.0

New Wikidata version (2T_WD)

Fix header for tables CTRL_DBP_MUS_rock_bands_labels.csv and CTRL_DBP_MUS_rock_bands_labels_NOISE2.csv (column 2 was reported with id 1 in target - NOTE: the affected column has been removed from the SemTab2020 evaluation)

Remove duplicated entries in tables

Remove rows with wrong values (e.g., the Kazakhstan entity has an empty name "''")

Many rows and noised columns are shuffled/changed due to the random noise generator algorithm

Remove row "Florida","Floorida","New York, NY" from TOUGH_WEB_MISSP_1000_us_cities.csv (and all its NOISE1 variants)

Fix header of tables:

CTRL_WIKI_POL_List_of_current_monarchs_of_sovereign_states.csv

CTRL_WIKI_POL_List_of_current_monarchs_of_sovereign_states_NOISE2.csv

TOUGH_T2D_BUS_29414811_2_4773219892816395776_videogames_developers.csv

TOUGH_T2D_BUS_29414811_2_4773219892816395776_videogames_developers_NOISE2.csv

v0.1-pre

First submission. It contains only tables, without GT and Targets.
Nerwip Corpus v4 - Data
figshare.com
zip
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Labatut (2016). Nerwip Corpus v4 - Data [Dataset]. http://doi.org/10.6084/m9.figshare.1318733.v6
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1318733.v6
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Vincent Labatut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Set of 408 biographic articles extracted from Wikipedia. Most of them are represented by 5 different files : text only, text and hyperlinks, annotations, meta-data, and html.
P
WikiEvents Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sha Li; Heng Ji; Jiawei Han, WikiEvents Dataset [Dataset]. https://paperswithcode.com/dataset/wikievents
Explore at:
Authors
Sha Li; Heng Ji; Jiawei Han
Description
WikiEvents is a document-level event extraction benchmark dataset which includes complete event and coreference annotation.
Nerwip Corpus v4 - Data
figshare.com
zip
Updated Jan 19, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Labatut (2016). Nerwip Corpus v4 - Data [Dataset]. http://doi.org/10.6084/m9.figshare.1318733.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1318733.v2
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Vincent Labatut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Set of 409 biographic articles extracted from Wikipedia. Most of them are represented by 5 different files : text only, text and hyperlinks, annotations, meta-data, and html.
Z
Data from: Image Label Wikification Collection (ILWC)
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oard, Douglas (2020). Image Label Wikification Collection (ILWC) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3353812
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Oard, Douglas
Suzuki, Tokinori
Ikeda, Daisuke
Galuščáková, Petra
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Image Label Wikification Collection (ILWC) is a test collection for evaluating systems that link descriptional labels of images to Wikipedia pages as the labels' entities. ILWC consists of 450 images and a total of 2,280 user-assigned labels with manual correct annotations.

This collection contains following files:

image_URL.txt

topics.txt

rels.txt

ImageCLEF_concepts.txt

image_URL.txt is a list of image files and Flickr URLs from which the files were downloaded.

topics.txt is a topic file. First two letters of topic ID in the topic file correspond to the collection category of animal names. The correspondence is as follows:

Albatross

Bee

Bison

Boar

Coyote

Cricket

Jaguar

Kite

Llama

Mouse,

Panther

Quail,

Stingray

Tiger shark

Whippet

rels.txt is a list of correct Wikipedia pages for user-assigned image labels of the topics.

ImageCLEF_concepts.txt is a list of the used concepts in ImageCLEF2014 Scalable Concept Image Annotation task test collection for Image Label Wikification task.
E
Data from: Tokenized and POS-Tagged Khmer Data of the Asian Language...
live.european-language-grid.eu
explore.openaire.eu
nova
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Tokenized and POS-Tagged Khmer Data of the Asian Language Treebank Project [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7816
Explore at:
novaAvailable download formats
Dataset updated
Mar 26, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction:
This is the Khmer ALT of the Asian Language Treebank (ALT) Corpus. English texts sampled from English Wikinews were available under a Creative Commons Attribution 2.5 License.
Please refer to http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/index.html for an introduction of the ALT project.
Khmer ALT has been developed by NICT and NIPTICT. The license of Khmer ALT is Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) Licensehttps://creativecommons.org/licenses/by-nc-sa/4.0/*
Contents
- data_km.km-[tok|tag].nova : tokenized/POS-tagged Khmer sentences by the nova annotation system
# based on the following two guildelines
# http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/Khmer-annotation-guideline.pdf
# http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/Khmer-annotation-guideline-supplementary.pdf*
Disclaimer
[1] The content of the selected English Wikinews articles have been translated for this corpus. English texts sampled from English Wikinews were available under a Creative Commons Attribution 2.5 License. Users of the corpus are requested to take careful consideration when encountering any instances of defamation, discriminatory terms, or personal information that might be found within the corpus. Users of the corpus are advised to read Terms of Use in https://en.wikinews.org/wiki/Main_Page carefully to ensure proper usage.
[2] NICT bears no responsibility for the contents of the corpus and the lexicon and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the corpus or the lexicon.
[3] If any copyright infringement or other problems are found in the corpus or the lexicon, please contact us at alt-info[at]khn[dot]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.
u
Data from: Athalia rosae Official Gene Set OGSv1.0
agdatacommons.nal.usda.gov
application/x-gzip
Updated Feb 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Philip Oeyen; Masatsugu Hatakeyama; Daniel S.T. Hughes; Stephen Richards; Bernhard Misof; Oliver Niehuis (2024). Athalia rosae Official Gene Set OGSv1.0 [Dataset]. http://doi.org/10.15482/USDA.ADC/1459566
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1459566
Dataset updated
Feb 13, 2024
Dataset provided by
Ag Data Commons
Authors
Jan Philip Oeyen; Masatsugu Hatakeyama; Daniel S.T. Hughes; Stephen Richards; Bernhard Misof; Oliver Niehuis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Athalia rosae genome was recently sequenced and annotated as part of the i5k pilot project by the Baylor College of Medicine. The Athalia rosae research community has manually reviewed and curated the computational gene predictions and generated an official gene set, OGSv1.0. The general procedure for generating this OGS is outlined here: https://github.com/NAL-i5K/I5KNAL_OGS/wiki. OGSv1.0 was generated by merging gene set AROS-V0.5.3-Models generated by the Baylor College of Medicine, and community-curated models in the Apollo software, after QC of the Apollo output. Resources in this dataset:Resource Title: Athalia rosae Official Gene Set OGSv1.0. File Name: athros_OGS_v1-0.tar.gzResource Description: The attached tar.gz archive (athros_OGS_v1-0.tar.gz) contains the following files:

ATHROS_OGSv1-0_cds.fa. CDS sequences of Athalia rosae genome annotations OGSv1.0. ATHROS_OGSv1-0_pep.fa. Amino acid sequences of Athalia rosae genome annotations OGSv1.0. ATHROS_OGSv1-0_transcript.fa. Transcript sequences of Athalia rosae genome annotations OGSv1.0. ATHROS_OGSv1-0.gff3. Gff3 of all gene predictions of Athalia rosae genome annotations OGSv1.0 ATHROS_Manual2OGSv1.0_id_mapping.log. A mapping file describing ID and name updates from dataset Athalia rosae genome annotations v0.5.3. readme_athros_OGS_v1-0_release. This file briefly describes how the Athalia rosae Official Gene Set OGSv1.0 was generated.
E
Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.2
live.european-language-grid.eu
binary format
Updated Apr 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Parallel sense-annotated corpus ELEXIS-WSD 1.2 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23849
Explore at:
binary formatAvailable download formats
Dataset updated
Apr 3, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.2 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.

The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2.

List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene

The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl).

Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.

For more information, please refer to 00README.txt.

Updates in version 1.2: - Several tokenization errors with multiword tokens were fixed in all subcorpora (e.g. the order of subtokens was incorrect in many cases; the issue has now been resolved). - XPOS, FEATS, HEAD, and DEPREL columns were added automatically with UDPipe (except for elexis-wsd-sl and elexis-wsd-et; for Slovene, all columns were manually validated; for Estonian, HEAD and DEPREL were manually validated; all other languages contain automatic tags in these columns – for more information on the models used and their performance, see 00README.txt). - The entry now includes lists of potential errors in automatically assigned XPOS and FEATS values. In previous versions, only UPOS tags were manually annotated, while the XPOS and FEATS columns were left empty. XPOS and FEATS have now been added automatically through UDPipe. The list of potential errors contains the list of lines in the corpus in which the XPOS and FEATS columns are potentially incorrect because the manually validated UPOS tag differs from the automatically assigned UPOS tag, which indicates that the automatically assigned XPOS and FEATS columns are probably incorrect. This is meant as a reference for future validation efforts. - For Slovene, named entity annotations were added based on the annotations from the SUK 1.1 Training Corpus of Slovene (http://hdl.handle.net/11356/1959).
Z
WikiMed and PubMedDS: Two large-scale datasets for medical concept...
data.niaid.nih.gov
zenodo.org
Updated Dec 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosé, Carolyn P (2021). WikiMed and PubMedDS: Two large-scale datasets for medical concept extraction and normalization research [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5753475
Explore at:
Dataset updated
Dec 4, 2021
Dataset provided by
Joshi, Rishabh
Vashishth, Shikhar
Dutt, Ritam
Newman-Griffis, Denis
Rosé, Carolyn P
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).

WikiMed

Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.

WikiMed contains:

393,618 Wikipedia page texts

1,067,083 mentions of medical concepts

57,739 unique UMLS CUIs

Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.

PubMedDS

Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.

PubMedDS contains:

13,197,430 abstract texts

57,943,354 medical concept mentions

44,881 unique UMLS CUIs

Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.

Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.

Data format

Both datasets use JSON format with one document per line. Each document has the following structure:

{ "_id": "A unique identifier of each document", "text": "Contains text over which mentions are ", "title": "Title of Wikipedia/PubMed Article", "split": "[Not in PubMedDS] Dataset split: ", "mentions": [ { "mention": "Surface form of the mention", "start_offset": "Character offset indicating start of the mention", "end_offset": "Character offset indicating end of the mention", "link_id": "UMLS CUI. In case of multiple CUIs, they are concatenated using '|', i.e., CUI1|CUI2|..." }, {} ] }
o
The MODA sleep spindle dataset: A large, open, high quality dataset of...
osf.io
Updated Sep 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Yetton; Karine Lacourse; Jacques Delfrate; Sara Mednick; Simon Warby; Prathamesh Kulkarni; Icaro Miranda; Xuyun Sun; Chaehwa Yoo; Raphael Vallat; Han Yu; Shailaja Akella; Joachim Bellet (2020). The MODA sleep spindle dataset: A large, open, high quality dataset of annotated sleep spindles [Dataset]. http://doi.org/10.17605/OSF.IO/8BMA7
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/8BMA7
Dataset updated
Sep 15, 2020
Dataset provided by
Center For Open Science
Authors
Benjamin Yetton; Karine Lacourse; Jacques Delfrate; Sara Mednick; Simon Warby; Prathamesh Kulkarni; Icaro Miranda; Xuyun Sun; Chaehwa Yoo; Raphael Vallat; Han Yu; Shailaja Akella; Joachim Bellet
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Online Portal for the MODA sleep spindle database. Here you can download high quality annotated spindle data. Please see Wiki for more information on how to download data.
o
TS Wiki Tagged Lemmatized nGrams
explore.openaire.eu
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taner Sezer (2022). TS Wiki Tagged Lemmatized nGrams [Dataset]. http://doi.org/10.57672/0z31-ga11
Explore at:
Unique identifier
https://doi.org/10.57672/0z31-ga11
Dataset updated
Jan 1, 2022
Authors
Taner Sezer
Description
PosTagged and Lemmatized nGrams extracted from TS Wikipedia Corpus. This dataset consists 12.101.939 nGrams harvested from TS Wikipedia Corpus. The data set is consisted of 9 tab separated columns respectively presents the first word of nGram, its part of speech annotation, the lemma for the first word, the second word, its part of speech annotation, the lemma of the second word, the observed frequency of nGram, the frequency of the first word and the frequency of the second word in the whole data set. Given frequencies are calculated case-sensitively
d
GONUTS
dknet.org
scicrunch.org
+2more
Updated Aug 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). GONUTS [Dataset]. http://identifiers.org/RRID:SCR_000653
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_000653
Dataset updated
Aug 29, 2024
Description
A wiki where users of the Gene Ontology can contribute and view notes about how specific GO terms are used. GONUTS can also be used as a GO term browser, or to search for GO annotations of specific genes from included organisms. The rationale for this wiki is based on helping new users of the gene ontology understand and use it. The GONUTS wiki is not an official product of the the Gene Ontology consortium. The GO consortium has a public wiki at their website, http://wiki.geneontology.org/. Maintaining the ontology involves many decisions to carefully choose terms and relationships. These decisions are currently made at GO meetings and via online discussion using the GO mailing lists and the Sourceforge curator request tracker. However, it is difficult for someone starting to use GO to understand these decisions. Some insight can be obtained by mining the tracker, the listservs and the minutes of GO meetings, but this is difficult, as these discussions are often dispersed and sometimes don't contain the GO accessions in the relevant messages. Wikis provide a way to create collaboratively written documentation for each GO term to explain how it should be used, how to satisfy the true path requirement, and whether an annotation should be placed at a different level. In addition, the wiki pages provide a discussion space, where users can post questions and discuss possible changes to the ontology. GONUTS is currently set up so anyone can view or search, but only registered users can edit or add pages. Currently registered users can create new users, and we are working to add at least one registered user for each participating database (So far we have registered users at EcoliHub, EcoCyc, GOA, BeeBase, SGD, dictyBase, FlyBase, WormBase, TAIR, Rat Genome Database, ZFIN, MGI, UCL and AgBase...

Facebook

Twitter

Click to copy link

Link copied

Cite

(2021). Annotated Web Tables [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7387

Annotated Web Tables

Explore at:

11 scholarly articles cite this dataset (View in Google Scholar)

csvAvailable download formats

Dataset updated

Sep 25, 2021

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data sets used for experimental evaluation in the related publication:Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity EmbeddingsInternational Semantic Web Conference (1) 2017: 260-277Vasilis Efthymiou Oktie Hassanzadeh Mariano Rodríguez-Muro Vassilis ChristophidesThe gold standard data sets are collections of web tables:T2D (v1) consists of a schema-level gold standard of 1,748 Web tables, manually annotated with class- and property-mappings, as well as an entity-level gold standard of 233 Web tables.Limaye consists of 400 manually annotated Web tables with entity-, class-, and property-level correspondences, where single cells (not rows) are mapped to entities. The corrected version of this gold standard is adapted to annotate rows with entities, from the annotations of the label column cells.WikipediaGS is an instance-level gold standard developed from 485K Wikipedia tables, in which links in the label column are used to infer the annotation of a row to a DBpedia entity. Note on license: please refer to the README.txt. Data is derived from Wikipedia and other sources may have different licenses.Wikipedia contents can be shared under the terms of Creative Commons Attribution-ShareAlike Licenseas outlined on Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_contentThe correspondences of the T2D Gold standard is provided under the terms of the Apache license. The Web tables are provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. The DBpedia subset is licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License that applies to DBpedia.Limaye gold standard is downloaded from: http://websail-fe.cs.northwestern.edu/TabEL/ (download date: August 25, 2016). Please refer to the original website and the following paper for more details and citation information:G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB, 3(1):1338â€“1347, 2010.THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Clear search

Close search

Google apps

Main menu

Annotated Web Tables

AOP-Wiki Event Component Annotation

Wikipedia video games similarity dataset with expert annotations

English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

Data from: Learning multilingual named entity recognition from Wikipedia

Wikipedia Gene Ontology Annotations (WikiGOA)

Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

DAWT Dataset

Data from: Tough Tables: Carefully Evaluating Entity Linking for Tabular...

Nerwip Corpus v4 - Data

WikiEvents Dataset

Nerwip Corpus v4 - Data

Data from: Image Label Wikification Collection (ILWC)

Data from: Tokenized and POS-Tagged Khmer Data of the Asian Language...

Data from: Athalia rosae Official Gene Set OGSv1.0

Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.2

WikiMed and PubMedDS: Two large-scale datasets for medical concept...

The MODA sleep spindle dataset: A large, open, high quality dataset of...

TS Wiki Tagged Lemmatized nGrams

GONUTS

Annotated Web Tables