15 datasets found
  1. h

    corpus-carolina

    • huggingface.co
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carolina C4AI (2024). corpus-carolina [Dataset]. https://huggingface.co/datasets/carolina-c4ai/corpus-carolina
    Explore at:
    Dataset updated
    Aug 2, 2024
    Dataset authored and provided by
    Carolina C4AI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-).

  2. h

    corpus-carolina-jud-lgpd

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcelo Anselmo de Souza Filho, corpus-carolina-jud-lgpd [Dataset]. https://huggingface.co/datasets/celiudos/corpus-carolina-jud-lgpd
    Explore at:
    Authors
    Marcelo Anselmo de Souza Filho
    License

    https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

    Description

    Carolina Corpus with data annotated in accordance with the LGPD (Brazilian General Data Protection Law)

    This dataset is a derivative of the Carolina Corpus. We analyzed and filtered the content in search of personal data for academic purposes. We balanced the dataset to train the model https://huggingface.co/celiudos/legal-bert-lgpd

    Labels En

    NOME NAME

    DATA DATE

    ENDERECO ADDRESS

    CEP ZIPCODE

    CPF CPF

    TELEFONE PHONE

    EMAIL EMAIL

    DINHEIRO MONEY… See the full description on the dataset page: https://huggingface.co/datasets/celiudos/corpus-carolina-jud-lgpd.

  3. P

    ASSET Corpus Dataset

    • paperswithcode.com
    Updated Feb 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Alva-Manchego; Louis Martin; Antoine Bordes; Carolina Scarton; Benoît Sagot; Lucia Specia (2020). ASSET Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/asset-corpus
    Explore at:
    Dataset updated
    Feb 13, 2020
    Authors
    Fernando Alva-Manchego; Louis Martin; Antoine Bordes; Carolina Scarton; Benoît Sagot; Lucia Specia
    Description

    A crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations.

  4. h

    corpus-carolina-filtered

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miguel M. Carpi, corpus-carolina-filtered [Dataset]. https://huggingface.co/datasets/mmcarpi/corpus-carolina-filtered
    Explore at:
    Authors
    Miguel M. Carpi
    Description

    mmcarpi/corpus-carolina-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Diachronic Corpus of Mission Statements for NC and FL Community Colleges

    • zenodo.org
    csv
    Updated Sep 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D. F. Ayers; D. F. Ayers; M. Hou; M. Hou; Brooks. W. D.; K. Cossey; K. Cossey; Brooks. W. D. (2021). Diachronic Corpus of Mission Statements for NC and FL Community Colleges [Dataset]. http://doi.org/10.5281/zenodo.5504340
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 17, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    D. F. Ayers; D. F. Ayers; M. Hou; M. Hou; Brooks. W. D.; K. Cossey; K. Cossey; Brooks. W. D.
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    This is a diachronic corpus of mission statements, philosophy statements, and purpose statements for community colleges in North Carolina and Florida. Texts date from the mid-1960s to 2020. Texts are indexed to IPEDS unit id. Texts for some years are missing. "OTM" means other than mission (which is typically a statement of purpose but may include statement of goals). Data were retrieved from archived catalogs and archived websites (e.g., Wayback Machine). The highest level of heading was used. For example, if a college published a statement of mission and a statement of purpose, the statement with the most prominent (typically the first) heading was used.

  6. Spanish SciELO Crawled Biomedical Corpus

    • zenodo.org
    txt
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabián Villena; Carolina Chiu; Jocelyn Dunstan; Fabián Villena; Carolina Chiu; Jocelyn Dunstan (2022). Spanish SciELO Crawled Biomedical Corpus [Dataset]. http://doi.org/10.5281/zenodo.5902835
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fabián Villena; Carolina Chiu; Jocelyn Dunstan; Fabián Villena; Carolina Chiu; Jocelyn Dunstan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a corpus of Spanish medical articles extracted from the SciELO website (https://scielo.cl/). The corpus was constructed using web scraping extraction techniques and consists of 5694 articles published between 2002 and 2020 across 34 journals specialized in health and biology (specified below).

    Two variables of the corpus are presented here: the first contains the text without pre-processing, which makes it possible to adapt the corpus for different purposes. The second and third have been subjected to pre-processing, which is carried out with the NLTK library in Python and removes capital letters, accents, punctuation, and any non-alphanumeric symbols. In addition, the corpus was tokenized by sentences, obtaining a total of 500828 sentences and 13M tokens. In both cases, the articles are grouped by journal, in order of publication from most recent to oldest. These journals and their respective total of issues correspond to:

    1. Acta bioethica (43 issues)

    2. Anales del Instituto de la Patagonia (31 issues)

    3. Andes pediatrica (3 issues)

    4. Biological Research (63 issues)

    5. Ciencia y enfermería - Revista iberoamericana de investigación (45 issues)

    6. Gayana (Concepción) - International Journal of Biodiversity, Oceanology and Conservation (45 issues)

    7. Gayana. Botánica (41 issues)

    8. International Journal of Morphology (79 issues)

    9. International journal of interdisciplinary dentistry (5 issues)

    10. International journal of odontostomatology (39 issues)

    11. Latin american journal of aquatic research (58 issues)

    12. Revista chilena de cardiología (37 issues)

    13. Revista chilena de enfermedades respiratorias (76 issues)

    14. Revista chilena de entomología (5 issues)

    15. Revista chilena de historia natural (64 issues)

    16. Revista chilena de infectología (131 issues)

    17. Revista chilena de neuro-psiquiatría (89 issues)

    18. Revista chilena de nutrición (86 issues)

    19. Revista chilena de obstetricia y ginecología (117 issues)

    20. Revista chilena de radiología (79 issues)

    21. Revista de biología marina y oceanografía (58 issues)

    22. Revista de cirugía (16 issues)

    23. Revista de otorrinolaringología y cirugía de cabeza y cuello (52 issues)

    24. Revista médica de Chile (264 issues)

    Non-current titles

    1. Boletín chileno de parasitología (5 issues) - Jan 2002: Completed; Continued as Parasitología latinoamericana

    2. Ciencia & trabajo (18 issues) - Feb 2020: Indexing interrupted

    3. Electronic Journal of Biotechnology (84 issues) - April 2017: Indexing interrupted

    4. Investigaciones marinas (22 issues) - Nov 2007: Completed; Continued as Latin american journal of aquatic research

    5. Parasitología al día (9 issues) - Jul 2001: Completed; Continued as Parasitología latinoamericana

    6. Parasitología latinoamericana (13 issues) - Dec 2008: Completed

    7. Revista chilena de anatomía (14 issues) - 2002: Completed ; Continued as International Journal of Morphology

    8. Revista chilena de cirugía (78 issues) - May 2019: Completed ; Continued as Revista de cirugía

    9. Revista chilena de pediatría (476 números) - March 2021: Completed ; Continued as Andes pediatrica

    10. Revista clínica de periodoncia, implantología y rehabilitación oral (30 números) - April 2020: Completed ; Continued as International journal of interdisciplinary dentistry

  7. h

    corpus-carolina-100M-superbpe

    • huggingface.co
    Updated Jun 15, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miguel M. Carpi (2016). corpus-carolina-100M-superbpe [Dataset]. https://huggingface.co/datasets/mmcarpi/corpus-carolina-100M-superbpe
    Explore at:
    Dataset updated
    Jun 15, 2016
    Authors
    Miguel M. Carpi
    Description

    mmcarpi/corpus-carolina-100M-superbpe dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. P

    SemClinBr Dataset

    • paperswithcode.com
    Updated May 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Emanuel Silva e Oliveira; Ana Carolina Peters; Adalniza Moura Pucca da Silva; Caroline P. Gebeluca; Yohan Bonescki Gumiel; Lilian Mie Mukai Cintho; Deborah Ribeiro Carvalho; Sadid A. Hasan; Claudia Maria Cabral Moro, SemClinBr Dataset [Dataset]. https://paperswithcode.com/dataset/semclinbr
    Explore at:
    Dataset updated
    May 9, 2022
    Authors
    Lucas Emanuel Silva e Oliveira; Ana Carolina Peters; Adalniza Moura Pucca da Silva; Caroline P. Gebeluca; Yohan Bonescki Gumiel; Lilian Mie Mukai Cintho; Deborah Ribeiro Carvalho; Sadid A. Hasan; Claudia Maria Cabral Moro
    Description

    Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus. Keywords: Natural language processing, Semantic annotation, Clinical narratives, Corpora, Gold standard

  9. Word embeddings for the Spanish clinical language

    • zenodo.org
    bin
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carolina Chiu; Fabián Villena; Kinan Martin; Fredy Núñez; Cecilia Besa; Jocelyn Dunstan; Carolina Chiu; Fabián Villena; Kinan Martin; Fredy Núñez; Cecilia Besa; Jocelyn Dunstan (2022). Word embeddings for the Spanish clinical language [Dataset]. http://doi.org/10.5281/zenodo.6647060
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 12, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carolina Chiu; Fabián Villena; Kinan Martin; Fredy Núñez; Cecilia Besa; Jocelyn Dunstan; Carolina Chiu; Fabián Villena; Kinan Martin; Fredy Núñez; Cecilia Besa; Jocelyn Dunstan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Word embeddings for the Spanish clinical language

    Corpora used to compute the embeddings:

  10. f

    Count of documents for each language in the MLIA corpus.

    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iknoor Singh; Carolina Scarton; Kalina Bontcheva (2023). Count of documents for each language in the MLIA corpus. [Dataset]. http://doi.org/10.1371/journal.pone.0256874.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Iknoor Singh; Carolina Scarton; Kalina Bontcheva
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Count of documents for each language in the MLIA corpus.

  11. Data from: SimPA: A Sentence-Level Simplification Corpus for the Public...

    • zenodo.org
    • live.european-language-grid.eu
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia; Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia (2020). SimPA: A Sentence-Level Simplification Corpus for the Public Administration Domain [Dataset]. http://doi.org/10.5281/zenodo.2551297
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia; Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a sentence-level simplification corpus with content from the Public Administration (PA) domain. The corpus contains 1,100 original sentences with manual simplifications collected through a two-stage process. Firstly, annotators were asked to simplify only words and phrases (lexical simplification). Each sentence was simplified by three annotators. Secondly, one lexically simplified version of each original sentence was further simplified at the syntactic level. In its current version there are 3,300 lexically simplified sentences plus 1,100 syntactically simplified sentences. The corpus will be used for evaluation of text simplification approaches in the scope of the EU H2020 SIMPATICO project - which focuses on accessibility of e-services in the PA domain - and beyond. The main advantage of this corpus is that lexical and syntactic simplifications can be analysed and used in isolation. The lexically simplified corpus is also multi-reference (three different simplifications per original sentence). This is an ongoing effort and our final aim is to collect manual simplifications for the entire set of original sentences, with over 10K sentences.

  12. h

    corpus-synthetic-lgpd

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcelo Anselmo de Souza Filho, corpus-synthetic-lgpd [Dataset]. https://huggingface.co/datasets/celiudos/corpus-synthetic-lgpd
    Explore at:
    Authors
    Marcelo Anselmo de Souza Filho
    License

    https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

    Description

    Dataset: 105 samples for validation

    This dataset is a sample of 105 documents from the Carolina Corpus, with data annotated in accordance with the LGPD (Brazilian General Data Protection Law). It is part of an academic study for comparing legal language models. We used to validate the model https://huggingface.co/celiudos/legal-bert-lgpd The data has been modified to preserve privacy while maintaining the structure and content of the documents. The CPF (Brazilian ID Number) had its… See the full description on the dataset page: https://huggingface.co/datasets/celiudos/corpus-synthetic-lgpd.

  13. U.S. Coast Guard (USCG) Sectors

    • data.wu.ac.at
    Updated Jul 3, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Homeland Security (2018). U.S. Coast Guard (USCG) Sectors [Dataset]. https://data.wu.ac.at/schema/data_gov/NTZmYmE4YmEtNjRkMC00YTRlLWE4NDktMmJlMjkxNDgyMGEx
    Explore at:
    Dataset updated
    Jul 3, 2018
    Dataset provided by
    U.S. Department of Homeland Securityhttp://www.dhs.gov/
    Description

    The Coast Guard Sectors are delineated in the description in the 33 Code of Federal Regulations (CFR) for each Sector Boundary and Area of Responsibility where latitude and longitude coordinates, as well as county/state/national boundaries are included to describe the boundaries for each zone. In addition, whenever the Area of Responsibility boundary is over water, the EEZ shapefile is referenced for those occurrences. This layer displays the Coast Guard Sector Boundaries for the following sectorsAnchorage, Baltimore, Boston, Buffalo, Charleston, Columbia River, Corpus Christi, Delaware Bay, Detroit, Guam, Hampton Roads, Honolulu, Houston - Galveston, Humboldt Bay, Jacksonville, Juneau, Key West, Lake Michigan, Long Island Sound, Los Angeles - Long Beach, Lower Mississippi, Miami, Mobile, New Orleans, New York, North Bend, North Carolina, Northern New England, Ohio Valley, Puget Sound, San Diego, San Francisco, San Juan, Sault Ste Marie, Southeastern New England, St. Petersburg, and Upper Mississippi.

  14. c

    Datos Normativos del Sistema Internacional de Sonidos Afectivos (The...

    • ri.conicet.gov.ar
    • datosdeinvestigacion.conicet.gov.ar
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irrazabal, Natalia Carolina; Tonini, Fernando; Quián, Maria del Rosario; Feldberg, Carolina (2025). Datos Normativos del Sistema Internacional de Sonidos Afectivos (The International Affective Digitized Sounds, IADS-2) en una muestra Argentina [Dataset]. https://ri.conicet.gov.ar/handle/11336/254778
    Explore at:
    Dataset updated
    Feb 19, 2025
    Authors
    Irrazabal, Natalia Carolina; Tonini, Fernando; Quián, Maria del Rosario; Feldberg, Carolina
    License

    Attribution-NonCommercial-ShareAlike 2.5 (CC BY-NC-SA 2.5)https://creativecommons.org/licenses/by-nc-sa/2.5/
    License information was derived automatically

    Area covered
    Argentina
    Description

    Se trata de un corpus de estímulos emocionales auditivos confeccionado por Bradley & Lang, 2007. La base cuenta con un total de 167 sonidos emocionales agrupados en distintas categorías semánticas (i.e. contenido humano, animales, naturaleza, transporte) y caracterizados en función de las tres dimensiones afectivas: valencia, activación y dominancia. En términos técnicos, se trata de estímulos dinámicos con una duración de 6 segundos, su intensidad máxima se encuentra entre los 50.4 y 88 dB, se reproducen a través de dos canales y están codificados en formato .wav (Waveform Audio File Format). Para realizar esta validación se dividió la totalidad de los estímulos en tres grupos: el set 1 conformado por 53 estímulos, el set 2 conformado por 54 estímulos y el set 3 conformado por 60 estímulos.

  15. h

    negated_carolina

    • huggingface.co
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matheus Westhelle (2025). negated_carolina [Dataset]. https://huggingface.co/datasets/hapaxlegomenon/negated_carolina
    Explore at:
    Dataset updated
    Apr 27, 2025
    Authors
    Matheus Westhelle
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    NotCarolina

    This dataset contains examples of negation in Portuguese across multiple domains. It is derived from the Carolina Corpus, which we segment into sentences and filter for common negation words in Portuguese.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Carolina C4AI (2024). corpus-carolina [Dataset]. https://huggingface.co/datasets/carolina-c4ai/corpus-carolina

corpus-carolina

Carolina

carolina-c4ai/corpus-carolina

Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 2, 2024
Dataset authored and provided by
Carolina C4AI
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Carolina is an Open Corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-).

Search
Clear search
Close search
Google apps
Main menu