15 datasets found

h
corpus-carolina
huggingface.co
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carolina C4AI (2024). corpus-carolina [Dataset]. https://huggingface.co/datasets/carolina-c4ai/corpus-carolina
Explore at:
Dataset updated
Aug 2, 2024
Dataset authored and provided by
Carolina C4AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-).
h
corpus-carolina-jud-lgpd
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcelo Anselmo de Souza Filho, corpus-carolina-jud-lgpd [Dataset]. https://huggingface.co/datasets/celiudos/corpus-carolina-jud-lgpd
Explore at:
Authors
Marcelo Anselmo de Souza Filho
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
Carolina Corpus with data annotated in accordance with the LGPD (Brazilian General Data Protection Law)

This dataset is a derivative of the Carolina Corpus. We analyzed and filtered the content in search of personal data for academic purposes. We balanced the dataset to train the model https://huggingface.co/celiudos/legal-bert-lgpd

Labels En

NOME NAME

DATA DATE

ENDERECO ADDRESS

CEP ZIPCODE

CPF CPF

TELEFONE PHONE

EMAIL EMAIL

DINHEIRO MONEY… See the full description on the dataset page: https://huggingface.co/datasets/celiudos/corpus-carolina-jud-lgpd.
P
ASSET Corpus Dataset
paperswithcode.com
Updated Feb 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Alva-Manchego; Louis Martin; Antoine Bordes; Carolina Scarton; Benoît Sagot; Lucia Specia (2020). ASSET Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/asset-corpus
Explore at:
Dataset updated
Feb 13, 2020
Authors
Fernando Alva-Manchego; Louis Martin; Antoine Bordes; Carolina Scarton; Benoît Sagot; Lucia Specia
Description
A crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations.
h
corpus-carolina-filtered
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel M. Carpi, corpus-carolina-filtered [Dataset]. https://huggingface.co/datasets/mmcarpi/corpus-carolina-filtered
Explore at:
Authors
Miguel M. Carpi
Description
mmcarpi/corpus-carolina-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
Diachronic Corpus of Mission Statements for NC and FL Community Colleges
zenodo.org
csv
Updated Sep 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D. F. Ayers; D. F. Ayers; M. Hou; M. Hou; Brooks. W. D.; K. Cossey; K. Cossey; Brooks. W. D. (2021). Diachronic Corpus of Mission Statements for NC and FL Community Colleges [Dataset]. http://doi.org/10.5281/zenodo.5504340
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5504340
Dataset updated
Sep 17, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
D. F. Ayers; D. F. Ayers; M. Hou; M. Hou; Brooks. W. D.; K. Cossey; K. Cossey; Brooks. W. D.
License
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Description
This is a diachronic corpus of mission statements, philosophy statements, and purpose statements for community colleges in North Carolina and Florida. Texts date from the mid-1960s to 2020. Texts are indexed to IPEDS unit id. Texts for some years are missing. "OTM" means other than mission (which is typically a statement of purpose but may include statement of goals). Data were retrieved from archived catalogs and archived websites (e.g., Wayback Machine). The highest level of heading was used. For example, if a college published a statement of mission and a statement of purpose, the statement with the most prominent (typically the first) heading was used.
Spanish SciELO Crawled Biomedical Corpus
zenodo.org
txt
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabián Villena; Carolina Chiu; Jocelyn Dunstan; Fabián Villena; Carolina Chiu; Jocelyn Dunstan (2022). Spanish SciELO Crawled Biomedical Corpus [Dataset]. http://doi.org/10.5281/zenodo.5902835
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5902835
Dataset updated
Mar 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabián Villena; Carolina Chiu; Jocelyn Dunstan; Fabián Villena; Carolina Chiu; Jocelyn Dunstan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a corpus of Spanish medical articles extracted from the SciELO website (https://scielo.cl/). The corpus was constructed using web scraping extraction techniques and consists of 5694 articles published between 2002 and 2020 across 34 journals specialized in health and biology (specified below).

Two variables of the corpus are presented here: the first contains the text without pre-processing, which makes it possible to adapt the corpus for different purposes. The second and third have been subjected to pre-processing, which is carried out with the NLTK library in Python and removes capital letters, accents, punctuation, and any non-alphanumeric symbols. In addition, the corpus was tokenized by sentences, obtaining a total of 500828 sentences and 13M tokens. In both cases, the articles are grouped by journal, in order of publication from most recent to oldest. These journals and their respective total of issues correspond to:

Acta bioethica (43 issues)

Anales del Instituto de la Patagonia (31 issues)

Andes pediatrica (3 issues)

Biological Research (63 issues)

Ciencia y enfermería - Revista iberoamericana de investigación (45 issues)

Gayana (Concepción) - International Journal of Biodiversity, Oceanology and Conservation (45 issues)

Gayana. Botánica (41 issues)

International Journal of Morphology (79 issues)

International journal of interdisciplinary dentistry (5 issues)

International journal of odontostomatology (39 issues)

Latin american journal of aquatic research (58 issues)

Revista chilena de cardiología (37 issues)

Revista chilena de enfermedades respiratorias (76 issues)

Revista chilena de entomología (5 issues)

Revista chilena de historia natural (64 issues)

Revista chilena de infectología (131 issues)

Revista chilena de neuro-psiquiatría (89 issues)

Revista chilena de nutrición (86 issues)

Revista chilena de obstetricia y ginecología (117 issues)

Revista chilena de radiología (79 issues)

Revista de biología marina y oceanografía (58 issues)

Revista de cirugía (16 issues)

Revista de otorrinolaringología y cirugía de cabeza y cuello (52 issues)

Revista médica de Chile (264 issues)

Non-current titles

Boletín chileno de parasitología (5 issues) - Jan 2002: Completed; Continued as Parasitología latinoamericana

Ciencia & trabajo (18 issues) - Feb 2020: Indexing interrupted

Electronic Journal of Biotechnology (84 issues) - April 2017: Indexing interrupted

Investigaciones marinas (22 issues) - Nov 2007: Completed; Continued as Latin american journal of aquatic research

Parasitología al día (9 issues) - Jul 2001: Completed; Continued as Parasitología latinoamericana

Parasitología latinoamericana (13 issues) - Dec 2008: Completed

Revista chilena de anatomía (14 issues) - 2002: Completed ; Continued as International Journal of Morphology

Revista chilena de cirugía (78 issues) - May 2019: Completed ; Continued as Revista de cirugía

Revista chilena de pediatría (476 números) - March 2021: Completed ; Continued as Andes pediatrica

Revista clínica de periodoncia, implantología y rehabilitación oral (30 números) - April 2020: Completed ; Continued as International journal of interdisciplinary dentistry
h
corpus-carolina-100M-superbpe
huggingface.co
Updated Jun 15, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel M. Carpi (2016). corpus-carolina-100M-superbpe [Dataset]. https://huggingface.co/datasets/mmcarpi/corpus-carolina-100M-superbpe
Explore at:
Dataset updated
Jun 15, 2016
Authors
Miguel M. Carpi
Description
mmcarpi/corpus-carolina-100M-superbpe dataset hosted on Hugging Face and contributed by the HF Datasets community
P
SemClinBr Dataset
paperswithcode.com
Updated May 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Emanuel Silva e Oliveira; Ana Carolina Peters; Adalniza Moura Pucca da Silva; Caroline P. Gebeluca; Yohan Bonescki Gumiel; Lilian Mie Mukai Cintho; Deborah Ribeiro Carvalho; Sadid A. Hasan; Claudia Maria Cabral Moro, SemClinBr Dataset [Dataset]. https://paperswithcode.com/dataset/semclinbr
Explore at:
Dataset updated
May 9, 2022
Authors
Lucas Emanuel Silva e Oliveira; Ana Carolina Peters; Adalniza Moura Pucca da Silva; Caroline P. Gebeluca; Yohan Bonescki Gumiel; Lilian Mie Mukai Cintho; Deborah Ribeiro Carvalho; Sadid A. Hasan; Claudia Maria Cabral Moro
Description
Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus. Keywords: Natural language processing, Semantic annotation, Clinical narratives, Corpora, Gold standard
Word embeddings for the Spanish clinical language
zenodo.org
bin
Updated Sep 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carolina Chiu; Fabián Villena; Kinan Martin; Fredy Núñez; Cecilia Besa; Jocelyn Dunstan; Carolina Chiu; Fabián Villena; Kinan Martin; Fredy Núñez; Cecilia Besa; Jocelyn Dunstan (2022). Word embeddings for the Spanish clinical language [Dataset]. http://doi.org/10.5281/zenodo.6647060
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6647060
Dataset updated
Sep 12, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carolina Chiu; Fabián Villena; Kinan Martin; Fredy Núñez; Cecilia Besa; Jocelyn Dunstan; Carolina Chiu; Fabián Villena; Kinan Martin; Fredy Núñez; Cecilia Besa; Jocelyn Dunstan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Word embeddings for the Spanish clinical language

Corpora used to compute the embeddings:

Chilean waiting list corpus - https://zenodo.org/record/7072314

Medical Journal in Spanish - https://zenodo.org/record/7072352

UMLS Heading Sequences in Spanish - https://zenodo.org/record/7072323
f
Count of documents for each language in the MLIA corpus.
plos.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iknoor Singh; Carolina Scarton; Kalina Bontcheva (2023). Count of documents for each language in the MLIA corpus. [Dataset]. http://doi.org/10.1371/journal.pone.0256874.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0256874.t001
Dataset updated
Jun 9, 2023
Dataset provided by
PLOS ONE
Authors
Iknoor Singh; Carolina Scarton; Kalina Bontcheva
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Count of documents for each language in the MLIA corpus.
Data from: SimPA: A Sentence-Level Simplification Corpus for the Public...
zenodo.org
live.european-language-grid.eu
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia; Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia (2020). SimPA: A Sentence-Level Simplification Corpus for the Public Administration Domain [Dataset]. http://doi.org/10.5281/zenodo.2551297
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2551297
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia; Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a sentence-level simplification corpus with content from the Public Administration (PA) domain. The corpus contains 1,100 original sentences with manual simplifications collected through a two-stage process. Firstly, annotators were asked to simplify only words and phrases (lexical simplification). Each sentence was simplified by three annotators. Secondly, one lexically simplified version of each original sentence was further simplified at the syntactic level. In its current version there are 3,300 lexically simplified sentences plus 1,100 syntactically simplified sentences. The corpus will be used for evaluation of text simplification approaches in the scope of the EU H2020 SIMPATICO project - which focuses on accessibility of e-services in the PA domain - and beyond. The main advantage of this corpus is that lexical and syntactic simplifications can be analysed and used in isolation. The lexically simplified corpus is also multi-reference (three different simplifications per original sentence). This is an ongoing effort and our final aim is to collect manual simplifications for the entire set of original sentences, with over 10K sentences.
h
corpus-synthetic-lgpd
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcelo Anselmo de Souza Filho, corpus-synthetic-lgpd [Dataset]. https://huggingface.co/datasets/celiudos/corpus-synthetic-lgpd
Explore at:
Authors
Marcelo Anselmo de Souza Filho
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
Dataset: 105 samples for validation

This dataset is a sample of 105 documents from the Carolina Corpus, with data annotated in accordance with the LGPD (Brazilian General Data Protection Law). It is part of an academic study for comparing legal language models. We used to validate the model https://huggingface.co/celiudos/legal-bert-lgpd The data has been modified to preserve privacy while maintaining the structure and content of the documents. The CPF (Brazilian ID Number) had its… See the full description on the dataset page: https://huggingface.co/datasets/celiudos/corpus-synthetic-lgpd.
U.S. Coast Guard (USCG) Sectors
data.wu.ac.at
Updated Jul 3, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Homeland Security (2018). U.S. Coast Guard (USCG) Sectors [Dataset]. https://data.wu.ac.at/schema/data_gov/NTZmYmE4YmEtNjRkMC00YTRlLWE4NDktMmJlMjkxNDgyMGEx
Explore at:
Dataset updated
Jul 3, 2018
Dataset provided by
U.S. Department of Homeland Securityhttp://www.dhs.gov/
Description
The Coast Guard Sectors are delineated in the description in the 33 Code of Federal Regulations (CFR) for each Sector Boundary and Area of Responsibility where latitude and longitude coordinates, as well as county/state/national boundaries are included to describe the boundaries for each zone. In addition, whenever the Area of Responsibility boundary is over water, the EEZ shapefile is referenced for those occurrences. This layer displays the Coast Guard Sector Boundaries for the following sectorsAnchorage, Baltimore, Boston, Buffalo, Charleston, Columbia River, Corpus Christi, Delaware Bay, Detroit, Guam, Hampton Roads, Honolulu, Houston - Galveston, Humboldt Bay, Jacksonville, Juneau, Key West, Lake Michigan, Long Island Sound, Los Angeles - Long Beach, Lower Mississippi, Miami, Mobile, New Orleans, New York, North Bend, North Carolina, Northern New England, Ohio Valley, Puget Sound, San Diego, San Francisco, San Juan, Sault Ste Marie, Southeastern New England, St. Petersburg, and Upper Mississippi.
c
Datos Normativos del Sistema Internacional de Sonidos Afectivos (The...
ri.conicet.gov.ar
datosdeinvestigacion.conicet.gov.ar
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irrazabal, Natalia Carolina; Tonini, Fernando; Quián, Maria del Rosario; Feldberg, Carolina (2025). Datos Normativos del Sistema Internacional de Sonidos Afectivos (The International Affective Digitized Sounds, IADS-2) en una muestra Argentina [Dataset]. https://ri.conicet.gov.ar/handle/11336/254778
Explore at:
Dataset updated
Feb 19, 2025
Authors
Irrazabal, Natalia Carolina; Tonini, Fernando; Quián, Maria del Rosario; Feldberg, Carolina
License
Attribution-NonCommercial-ShareAlike 2.5 (CC BY-NC-SA 2.5)https://creativecommons.org/licenses/by-nc-sa/2.5/
License information was derived automatically
Area covered
Argentina
Description
Se trata de un corpus de estímulos emocionales auditivos confeccionado por Bradley & Lang, 2007. La base cuenta con un total de 167 sonidos emocionales agrupados en distintas categorías semánticas (i.e. contenido humano, animales, naturaleza, transporte) y caracterizados en función de las tres dimensiones afectivas: valencia, activación y dominancia. En términos técnicos, se trata de estímulos dinámicos con una duración de 6 segundos, su intensidad máxima se encuentra entre los 50.4 y 88 dB, se reproducen a través de dos canales y están codificados en formato .wav (Waveform Audio File Format). Para realizar esta validación se dividió la totalidad de los estímulos en tres grupos: el set 1 conformado por 53 estímulos, el set 2 conformado por 54 estímulos y el set 3 conformado por 60 estímulos.
h
negated_carolina
huggingface.co
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matheus Westhelle (2025). negated_carolina [Dataset]. https://huggingface.co/datasets/hapaxlegomenon/negated_carolina
Explore at:
Dataset updated
Apr 27, 2025
Authors
Matheus Westhelle
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
NotCarolina

This dataset contains examples of negation in Portuguese across multiple domains. It is derived from the Carolina Corpus, which we segment into sentences and filter for common negation words in Portuguese.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Carolina C4AI (2024). corpus-carolina [Dataset]. https://huggingface.co/datasets/carolina-c4ai/corpus-carolina

corpus-carolina

Carolina

carolina-c4ai/corpus-carolina

Explore at:

23 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Aug 2, 2024

Dataset authored and provided by

Carolina C4AI

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-).

Clear search

Close search

Google apps

Main menu

corpus-carolina

corpus-carolina-jud-lgpd

ASSET Corpus Dataset

corpus-carolina-filtered

Diachronic Corpus of Mission Statements for NC and FL Community Colleges

Spanish SciELO Crawled Biomedical Corpus

corpus-carolina-100M-superbpe

SemClinBr Dataset

Word embeddings for the Spanish clinical language

Count of documents for each language in the MLIA corpus.

Data from: SimPA: A Sentence-Level Simplification Corpus for the Public...

corpus-synthetic-lgpd

U.S. Coast Guard (USCG) Sectors

Datos Normativos del Sistema Internacional de Sonidos Afectivos (The...

negated_carolina

corpus-carolina

Carolina

carolina-c4ai/corpus-carolina