Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-).
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Carolina Corpus with data annotated in accordance with the LGPD (Brazilian General Data Protection Law)
This dataset is a derivative of the Carolina Corpus. We analyzed and filtered the content in search of personal data for academic purposes. We balanced the dataset to train the model https://huggingface.co/celiudos/legal-bert-lgpd
Labels En
NOME NAME
DATA DATE
ENDERECO ADDRESS
CEP ZIPCODE
CPF CPF
TELEFONE PHONE
EMAIL EMAIL
DINHEIRO MONEY… See the full description on the dataset page: https://huggingface.co/datasets/celiudos/corpus-carolina-jud-lgpd.
A crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations.
mmcarpi/corpus-carolina-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
This is a diachronic corpus of mission statements, philosophy statements, and purpose statements for community colleges in North Carolina and Florida. Texts date from the mid-1960s to 2020. Texts are indexed to IPEDS unit id. Texts for some years are missing. "OTM" means other than mission (which is typically a statement of purpose but may include statement of goals). Data were retrieved from archived catalogs and archived websites (e.g., Wayback Machine). The highest level of heading was used. For example, if a college published a statement of mission and a statement of purpose, the statement with the most prominent (typically the first) heading was used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a corpus of Spanish medical articles extracted from the SciELO website (https://scielo.cl/). The corpus was constructed using web scraping extraction techniques and consists of 5694 articles published between 2002 and 2020 across 34 journals specialized in health and biology (specified below).
Two variables of the corpus are presented here: the first contains the text without pre-processing, which makes it possible to adapt the corpus for different purposes. The second and third have been subjected to pre-processing, which is carried out with the NLTK library in Python and removes capital letters, accents, punctuation, and any non-alphanumeric symbols. In addition, the corpus was tokenized by sentences, obtaining a total of 500828 sentences and 13M tokens. In both cases, the articles are grouped by journal, in order of publication from most recent to oldest. These journals and their respective total of issues correspond to:
Acta bioethica (43 issues)
Anales del Instituto de la Patagonia (31 issues)
Andes pediatrica (3 issues)
Biological Research (63 issues)
Ciencia y enfermería - Revista iberoamericana de investigación (45 issues)
Gayana (Concepción) - International Journal of Biodiversity, Oceanology and Conservation (45 issues)
Gayana. Botánica (41 issues)
International Journal of Morphology (79 issues)
International journal of interdisciplinary dentistry (5 issues)
International journal of odontostomatology (39 issues)
Latin american journal of aquatic research (58 issues)
Revista chilena de cardiología (37 issues)
Revista chilena de enfermedades respiratorias (76 issues)
Revista chilena de entomología (5 issues)
Revista chilena de historia natural (64 issues)
Revista chilena de infectología (131 issues)
Revista chilena de neuro-psiquiatría (89 issues)
Revista chilena de nutrición (86 issues)
Revista chilena de obstetricia y ginecología (117 issues)
Revista chilena de radiología (79 issues)
Revista de biología marina y oceanografía (58 issues)
Revista de cirugía (16 issues)
Revista de otorrinolaringología y cirugía de cabeza y cuello (52 issues)
Revista médica de Chile (264 issues)
Non-current titles
Boletín chileno de parasitología (5 issues) - Jan 2002: Completed; Continued as Parasitología latinoamericana
Ciencia & trabajo (18 issues) - Feb 2020: Indexing interrupted
Electronic Journal of Biotechnology (84 issues) - April 2017: Indexing interrupted
Investigaciones marinas (22 issues) - Nov 2007: Completed; Continued as Latin american journal of aquatic research
Parasitología al día (9 issues) - Jul 2001: Completed; Continued as Parasitología latinoamericana
Parasitología latinoamericana (13 issues) - Dec 2008: Completed
Revista chilena de anatomía (14 issues) - 2002: Completed ; Continued as International Journal of Morphology
Revista chilena de cirugía (78 issues) - May 2019: Completed ; Continued as Revista de cirugía
Revista chilena de pediatría (476 números) - March 2021: Completed ; Continued as Andes pediatrica
Revista clínica de periodoncia, implantología y rehabilitación oral (30 números) - April 2020: Completed ; Continued as International journal of interdisciplinary dentistry
mmcarpi/corpus-carolina-100M-superbpe dataset hosted on Hugging Face and contributed by the HF Datasets community
Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus. Keywords: Natural language processing, Semantic annotation, Clinical narratives, Corpora, Gold standard
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Word embeddings for the Spanish clinical language
Corpora used to compute the embeddings:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Count of documents for each language in the MLIA corpus.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a sentence-level simplification corpus with content from the Public Administration (PA) domain. The corpus contains 1,100 original sentences with manual simplifications collected through a two-stage process. Firstly, annotators were asked to simplify only words and phrases (lexical simplification). Each sentence was simplified by three annotators. Secondly, one lexically simplified version of each original sentence was further simplified at the syntactic level. In its current version there are 3,300 lexically simplified sentences plus 1,100 syntactically simplified sentences. The corpus will be used for evaluation of text simplification approaches in the scope of the EU H2020 SIMPATICO project - which focuses on accessibility of e-services in the PA domain - and beyond. The main advantage of this corpus is that lexical and syntactic simplifications can be analysed and used in isolation. The lexically simplified corpus is also multi-reference (three different simplifications per original sentence). This is an ongoing effort and our final aim is to collect manual simplifications for the entire set of original sentences, with over 10K sentences.
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Dataset: 105 samples for validation
This dataset is a sample of 105 documents from the Carolina Corpus, with data annotated in accordance with the LGPD (Brazilian General Data Protection Law). It is part of an academic study for comparing legal language models. We used to validate the model https://huggingface.co/celiudos/legal-bert-lgpd The data has been modified to preserve privacy while maintaining the structure and content of the documents. The CPF (Brazilian ID Number) had its… See the full description on the dataset page: https://huggingface.co/datasets/celiudos/corpus-synthetic-lgpd.
The Coast Guard Sectors are delineated in the description in the 33 Code of Federal Regulations (CFR) for each Sector Boundary and Area of Responsibility where latitude and longitude coordinates, as well as county/state/national boundaries are included to describe the boundaries for each zone. In addition, whenever the Area of Responsibility boundary is over water, the EEZ shapefile is referenced for those occurrences. This layer displays the Coast Guard Sector Boundaries for the following sectorsAnchorage, Baltimore, Boston, Buffalo, Charleston, Columbia River, Corpus Christi, Delaware Bay, Detroit, Guam, Hampton Roads, Honolulu, Houston - Galveston, Humboldt Bay, Jacksonville, Juneau, Key West, Lake Michigan, Long Island Sound, Los Angeles - Long Beach, Lower Mississippi, Miami, Mobile, New Orleans, New York, North Bend, North Carolina, Northern New England, Ohio Valley, Puget Sound, San Diego, San Francisco, San Juan, Sault Ste Marie, Southeastern New England, St. Petersburg, and Upper Mississippi.
Attribution-NonCommercial-ShareAlike 2.5 (CC BY-NC-SA 2.5)https://creativecommons.org/licenses/by-nc-sa/2.5/
License information was derived automatically
Se trata de un corpus de estímulos emocionales auditivos confeccionado por Bradley & Lang, 2007. La base cuenta con un total de 167 sonidos emocionales agrupados en distintas categorías semánticas (i.e. contenido humano, animales, naturaleza, transporte) y caracterizados en función de las tres dimensiones afectivas: valencia, activación y dominancia. En términos técnicos, se trata de estímulos dinámicos con una duración de 6 segundos, su intensidad máxima se encuentra entre los 50.4 y 88 dB, se reproducen a través de dos canales y están codificados en formato .wav (Waveform Audio File Format). Para realizar esta validación se dividió la totalidad de los estímulos en tres grupos: el set 1 conformado por 53 estímulos, el set 2 conformado por 54 estímulos y el set 3 conformado por 60 estímulos.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
NotCarolina
This dataset contains examples of negation in Portuguese across multiple domains. It is derived from the Carolina Corpus, which we segment into sentences and filter for common negation words in Portuguese.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-).