40 datasets found

o
DigitalCorpora
registry.opendata.aws
Updated Jan 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simson Garfinkel (2021). DigitalCorpora [Dataset]. https://registry.opendata.aws/digitalcorpora/
Explore at:
Dataset updated
Jan 27, 2021
Dataset provided by
<a href="https://simson.net/">Simson Garfinkel</a>
Description
Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at http://www.simson.net/clips/academic/2009.DFRWS.Corpora.pdf
Digital Corpora PDFs as Images
kaggle.com
Updated Jul 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blackgaurd3 (2022). Digital Corpora PDFs as Images [Dataset]. https://www.kaggle.com/datasets/blackgaurd/digital-corpora-pdfs-as-images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 25, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Blackgaurd3
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
I needed a dataset of PDF files as images for a project and couldn't find another source online so I decided to make my own. The dataset currently consists of roughly 30k JPG images, but more might be added in the future.
DECM Annotated Corpus
figshare.com
bin
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patricia Murrieta-Flores; Diego Jiménez-Badillo; Bruno Emanuel da Graça Martins; Mariana Favila-Vázquez; Raquel Liceras-Garrido (2023). DECM Annotated Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.12366956.v3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12366956.v3
Dataset updated
Jun 3, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Patricia Murrieta-Flores; Diego Jiménez-Badillo; Bruno Emanuel da Graça Martins; Mariana Favila-Vázquez; Raquel Liceras-Garrido
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The DECM Corpus is a digital corpus of the texts of Relaciones Geográficas de Nueva España (the Geographic Reports of New Spain) with different versions, including a machine ready version, a gold standard annotated dataset, and an automatically annotated version ready for text mining and machine learning experiments.This is the version of the entire RG corpus automatically annotated using the ML models trained with the DECM Gold Standard Corpus. The files are available in JSON and TSV format, and it also contains the file for the DECM Ontology. This corpus can be further used for quantitative and qualitative research, as well as advanced analyses using text mining techniques, corpus linguistics and other methods such as Geographical Text Analysis.
Z
The Curated Courier: Digital Text Corpora from the UNESCO Courier...
data.niaid.nih.gov
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin, Benjamin George (2023). The Curated Courier: Digital Text Corpora from the UNESCO Courier (1948–2020) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10083489
Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Mohammadi Norén, Fredrik
Mähler, Roger
Martin, Benjamin George
Martin, Oriane Mathilde
Marklund, Andreas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Founded in 1948 as the official magazine of the United Nations Educational, Scientific and Cultural Organization, The UNESCO Courier represents an extraordinary resource for research on global themes in the humanities. The complete archive of the magazine is available in PDF form through UNESCO. These files make it possible for users anywhere to read individual issues, but it does not allow for full-text searching, much less any of the computational text analysis methods that have recently made important advances in humanities research. The Curated Courier 1.0 is a package of digital text corpora, text analysis tools, and supplementary materials that makes the complete archive of The UNESCO Courier from 1948 to 2020 machine-readable, accessible, and reusable for digital text analysis. Here on Zenodo we publish two Courier corpora. The first corpus (curated_courier_article_corpus) consists of the texts of all articles published in the English-language edition of The UNESCO Courier between 1948 and 2020. For this corpus we have extracted and reconstructed the complete text of all articles, for example by pulling together non-contiguous pages where necessary and by removing non-article text (masthead, photo captions, letters to the editor, and so on). We have linked each article to a comprehensive curated metadata index, included in the download (document_index.csv). The second corpus (curated_issues) compiles the complete text of all Courier issues (English-language edition), 1948-2020. To prepare this corpus we extracted text from the PDFs that UNESCO has made available, used multiple modes of OCR, and rendered each issue as a simple text file. Our test of the OCR quality finds an average error rate of 0.7 %, which should be considered good quality. Working data from the process can be found in our GitHub repository "tagged Courier." The products, text analysis tools, and additional documentation are in the repository "Curated Courier." The text of The UNESCO Courier is available in Open Access under the Attribution-ShareAlike 3.0 IGO (CC-BY-SA 3.0 IGO) license, in the context of UNESCO's open access publications policy. This dataset is published under the most recent version of the same license: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0 Deed). These datasets was developed as part of the research project "International Ideas at UNESCO: Digital Approaches to Global Conceptual History" (INIDUN), led by Benjamin G. Martin at Uppsala University and funded by a grant from the Swedish Research Council (Vetenskapsrådet), 2020-2024. For more information, see: https://inidun.github.io, as well as the project repository on GitHub, which includes documentation and files related to the curating process.
f
Data from 'Dative absolutes in discourse: the value of deeply versus...
figshare.com
txt
Updated Jun 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nilo Pedrazzini (2022). Data from 'Dative absolutes in discourse: the value of deeply versus strategically annotated treebanks' [Dataset]. http://doi.org/10.6084/m9.figshare.12894035.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12894035.v1
Dataset updated
Jun 27, 2022
Dataset provided by
figshare
Authors
Nilo Pedrazzini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project contains all the datasets used in the paper 'Early Slavic dative absolutes in discourse: the value of deeply versus strategically annotated treebanks'.- 'egda_raw.csv' contains all egda-clauses in the Codex Marianus. The only part which has been manipulated is where two subjects were coordinated by i 'and'. In these cases, an extra row was created, allowing both subjects to appear in the ocs_sub_lemma column. The row containing the second subject was left empty under all but the subject lemma variable. This allows to observe frequencies regarding lexical variation among egda-clauses' subjects, but at the same time to discard those rows when dealing with other variables.- 'egda_manipulated.csv' considers all bystъ-clauses as pre-matrix.- 'DA_Marianus_raw.csv' contains all dative absolutes in the Codex Marianus, as well as genitive absolutes for which there is an OCS parallel. It lists as separate entries both multiple dative participles with one dative subjects, and multiple dative subjects with one dative participle.E.g.:1) бꙑвъши же печали и гонению словесе ради абье съблажнѣатъ сѧ2) и въшедъши дъштери еѩ иродиѣдѣ. i плѧсавъши и оугождъши иродовиBoth 1) and 2) are listed as multiple entries, although only 2) has technically more than one dative absolute.DA_Marianus_abridged.csv: This is the same as DA_Marianus_raw.csv, but lists as one dative absolute instances with multiple dative subjects and one dative participle. The criterion chosen was to only retain the entry for the subject which was the closest to the participle (the choice can make a difference should one want to consider the properties of a dative absolute with respect to its subjects).- 'DA_Marianus_manipulated.csv' (starting from DA_Marianus_abridged.csv) treats all dative absolutes in bystъ-clauses as pre-matrix.- 'DA_nogr_raw.csv' contains all the dative absolutes in the second case study (early Slavic texts without Greek parallels)- 'DA_nogr_harm.csv' contains the same dative absolutes as DA_nogr_raw.csv but with harmonized Church Slavonic and Old East Slavic spellings.- harmonize.py: script used to harmonize the Church Slavonic and Old East Slavic spellings in the paper's second case study.The reader interested in reproducing the results of the paper should refer to the 'manipulated' versions of both the egda-clause and the dative absolute datasets.
h
Axolotl-Spanish-Nahuatl
huggingface.co
Updated Aug 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
I Hackathon Somos NLP: PLN en Español (2022). Axolotl-Spanish-Nahuatl [Dataset]. https://huggingface.co/datasets/somosnlp-hackathon-2022/Axolotl-Spanish-Nahuatl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2022
Dataset authored and provided by
I Hackathon Somos NLP: PLN en Español
License
https://choosealicense.com/licenses/mpl-2.0/https://choosealicense.com/licenses/mpl-2.0/
Description
Axolotl-Spanish-Nahuatl : Parallel corpus for Spanish-Nahuatl machine translation

Dataset Collection

In order to get a good translator, we collected and cleaned two of the most complete Nahuatl-Spanish parallel corpora available. Those are Axolotl collected by an expert team at UNAM and Bible UEDIN Nahuatl Spanish crawled by Christos Christodoulopoulos and Mark Steedman from Bible Gateway site. After this, we ended with 12,207 samples from Axolotl due to misalignments and… See the full description on the dataset page: https://huggingface.co/datasets/somosnlp-hackathon-2022/Axolotl-Spanish-Nahuatl.
e
Paraly: Replication package for exploring the concept of paralysis (fr....
b2find.eudat.eu
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Jul 24, 2025
Area covered
French
Description
This replication package provides all necessary resources to reproduce the dataset and methodological approach described in the Paraly data paper. The dataset consists of three corpora (full texts and metadata) of French literature from the 18th, 19th, and 20th centuries, containing both figurative and concrete linguistic references (annotations) to the concept of paralysis. The texts originate from the “Les classiques de la littérature” collection maintained on Gallica, the digital library of the Bibliothèque nationale de France (BnF). The replication package includes scripts and documentation for data collection, extraction, processing, annotation, and model training. It contains: scripts for data and metadata collection, original OCR-ed texts with metadata from Gallica, text excerpts containing the character sequence “paraly” and their manual annotations, annotation guidelines detailing the methodology used, a pre-trained multilabel classifier trained on the annotated data using the flair library, a graphical user interface application for automatic annotation, code and workflows for processing text corpora. By providing these resources, the replication package enables researchers to reproduce the dataset creation process, refine the annotation workflow, and extend the methodological approach to other literary corpora.
d
Dataset for \"Quantifying the quantitative (re-)turn in historical...
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McGillivray, Barbara; Jenset, Gard (2023). Dataset for \"Quantifying the quantitative (re-)turn in historical linguistics\" [Dataset]. http://doi.org/10.7910/DVN/IIHRZ3
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/IIHRZ3
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
McGillivray, Barbara; Jenset, Gard
Description
This dataset contains the data analysed in the article "Quantifying the quantitative (re-)turn in historical linguistics" authored by Barbara McGillivray and Gard Jenset and published in the journal "Humanities and Social Sciences Communications" in 2023. The dataset contains our analysis of 63 articles published in 2018 in six historical linguistics journals (Diachronica, Folia Linguistica Historica, Journal of Historical Linguistics, Language Dynamics and change, Language variation and change, and Transactions of the Philological Society). We recorded the following information: the type evidence base used in the paper (digital corpora, word lists, examples, etc.) and the statistical techniques used for the analysis, if any (t-tests, regression models, principal component analysis, etc.). We then classified the articles across two dimensions: corpus-based vs. non corpus-based and quantitative vs. non quantitative.
Corpus Christi, Texas 1/3 arc-second MHW Coastal Digital Elevation Model
catalog.data.gov
s.cnmilf.com
+2more
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA National Centers for Environmental Information (Point of Contact) (2024). Corpus Christi, Texas 1/3 arc-second MHW Coastal Digital Elevation Model [Dataset]. https://catalog.data.gov/dataset/corpus-christi-texas-1-3-arc-second-mhw-coastal-digital-elevation-model1
Explore at:
Dataset updated
Oct 18, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
Area covered
Corpus Christi, Texas
Description
NOAA's National Geophysical Data Center (NGDC) is building high-resolution digital elevation models (DEMs) for select U.S. coastal regions. These integrated bathymetric-topographic DEMs are used to support tsunami forecasting and modeling efforts at the NOAA Center for Tsunami Research, Pacific Marine Environmental Laboratory (PMEL). The DEMs are part of the tsunami forecast system SIFT (Short-term Inundation Forecasting for Tsunamis) currently being developed by PMEL for the NOAA Tsunami Warning Centers, and are used in the MOST (Method of Splitting Tsunami) model developed by PMEL to simulate tsunami generation, propagation, and inundation. Bathymetric, topographic, and shoreline data used in DEM compilation are obtained from various sources, including NGDC, the U.S. National Ocean Service (NOS), the U.S. Geological Survey (USGS), the U.S. Army Corps of Engineers (USACE), the Federal Emergency Management Agency (FEMA), and other federal, state, and local government agencies, academic institutions, and private companies. DEMs are referenced to the vertical tidal datum of Mean High Water (MHW) and horizontal datum of World Geodetic System 1984 (WGS84). Grid spacings for the DEMs range from 1/3 arc-second (~10 meters) to 3 arc-seconds (~90 meters).
Z
Accelerating Digital Skills for Music Researchers - Processing Text-Based...
data.niaid.nih.gov
zenodo.org
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Perevedentseva, Maria (2024). Accelerating Digital Skills for Music Researchers - Processing Text-Based Corpora for Musical Discourse Analysis - Episode 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10891401
Explore at:
Dataset updated
Mar 28, 2024
Dataset authored and provided by
Perevedentseva, Maria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset containing three subgenre-specific .xlsx files for the exercises in Episode 2 of the Processing Text-Based Corpora for Musical Discourse Analysis lesson of the Accelerating Digital Skills for Music Researchers project. The original data was collected from Boomkat.com with permission.
f
Data from 'One question, different annotation depths: A case study in Early...
figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nilo Pedrazzini (2023). Data from 'One question, different annotation depths: A case study in Early Slavic' [Dataset]. http://doi.org/10.6084/m9.figshare.12894035.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12894035.v2
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Authors
Nilo Pedrazzini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project contains all the datasets and scripts used for the paper:

Pedrazzini, Nilo. 2022. One question, different annotation depths: A case study in Early Slavic. Journal of Historical Syntax (Special Collection 'Annotating Historical Corpora') 6(7). 1-40. DOI: 10.18148/hs/2022.v6i4-11.96

Content: - das_marianus.csv: all dative absolutes found through TOROT in the Codex Marianus. Used for case study 1 (on 'deeply-annotated treebanks', Section 1 of the paper). - xadvs_marianus.csv: all conjunct participles found through TOROT in the Codex Marianus. Used for case study 1 (on 'deeply-annotated treebanks', Section 1 of the paper). - absdat_nogr.csv: all dative absolutes found through TOROT (except Codex Marianus), as of June 2020. Used for case study 2 (on 'shallowly-annotated treebanks', Section 2 of the paper). - bdinski_da.csv: dative absolutes found in Story of Abraham of Qidun and his niece Mary (Bdinski Sbornik). Used for case study 3 (on 'strategically-annotated treebanks', Section 3 of the paper). - JHS_Pedrazzini.R: R script for all the frequencies and plots in the paper. - harmon.py: script used to harmonize the Church Slavonic and Old East Slavic spellings in case study 2.
E
Credibility Corpus with several datasets (Twitter, Web database) in French...
live.european-language-grid.eu
txt
Updated Apr 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Credibility Corpus with several datasets (Twitter, Web database) in French and English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7468
Explore at:
txtAvailable download formats
Dataset updated
Apr 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
The set of these datasets are made to analyze information credibility in general (rumor and disinformation for English and French documents), and occuring on the social web. Target databases about rumor, hoax and disinformation helped to collect obviously misinformation. Some topic (with keywords) helps us to made corpora from the micrroblogging platform Twitter, great provider of rumors and disinformation.1 corpus describes Texts from the web database about rumors and disinformation. 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French). 4 corpora from Social Media Twitter randomly built (2 in English, 2 in French). 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French).Size of different corpora :Social Web Rumorous corpus: 1,612French Hollande Rumorous corpus (Twitter): 371 French Lemon Rumorous corpus (Twitter): 270 English Pin Rumorous corpus (Twitter): 679 English Swine Rumorous corpus (Twitter): 1024French 1st Random corpus (Twitter): 1000 French 2st Random corpus (Twitter): 1000 English 3st Random corpus (Twitter): 1000 English 4st Random corpus (Twitter): 1000French Rihanna Event corpus (Twitter): 543 English Rihanna Event corpus (Twitter): 1000 French Euro2016 Event corpus (Twitter): 1000 English Euro2016 Event corpus (Twitter): 1000A matrix links tweets with most 50 frequent wordsText data :_id : message id body text : string text dataMatrix data :52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 11,102 lines (each line is a message)Hidalgo corpus: lines range 1:75 Lemon corpus : lines range 76:467 Pin rumor : lines range 468:656 swine : lines range 657:1311random messages : lines range 1312:11103Sample contains : French Pin Rumorous corpus (Twitter): 679 Matrix data :52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 189 lines (each line is a message)
c
CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of...
research-data.cardiff.ac.uk
zip
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dawn Knight; Jonathan Morris; T Fitzpatrick; P Rayson; I Spasić; E-M Thomas; A Lovell; J Evas; M Stonelake; L Arman; J Davies; I Ezeani; S Neale; J Needs; S Piao; M Rees; G Watkins; L Williams; Vigneshwaran Muralidaran; B Tovey-Walsh; L Anthony; TM Cobb; M Deuchar; K Donnelly; M McCarthy; K Scannell; S Morris (2024). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh (Version 1.0.0) [Dataset]. http://doi.org/10.17035/d.2020.0119878310
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17035/d.2020.0119878310
Dataset updated
Nov 8, 2024
Dataset provided by
Cardiff University
Authors
Dawn Knight; Jonathan Morris; T Fitzpatrick; P Rayson; I Spasić; E-M Thomas; A Lovell; J Evas; M Stonelake; L Arman; J Davies; I Ezeani; S Neale; J Needs; S Piao; M Rees; G Watkins; L Williams; Vigneshwaran Muralidaran; B Tovey-Walsh; L Anthony; TM Cobb; M Deuchar; K Donnelly; M McCarthy; K Scannell; S Morris
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Overview:The CorCenCC corpus contains over 11 million words (circa 14.4m tokens) from written, spoken and electronic (online, digital texts) Welsh language sources, taken from a range of genres, language varieties (regional and social) and contexts. The contributors to CorCenCC are representative of the over half a million Welsh speakers in the country. The creation of CorCenCC was a community-driven project, which offered users of Welsh an opportunity to be proactive in contributing to a Welsh language resource that reflects how Welsh is currently used.To make CorCenCC as representative of contemporary Welsh as possible, the project team designed a bespoke sampling framework. Extracts were collected from sources including for example, journals, emails, sermons, road signs, TV programmes, meetings, magazines and books. Conversations were recorded by the research team, and a specially designed crowdsourcing app (see: https://www.corcencc.org/app/) enabled Welsh speakers in the community to record and upload samples of their own language use to the corpus. The published corpus therefore contains data from Welsh speakers from all kinds of backgrounds, abilities and contexts, capturing how Welsh is truly used today across the country.A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see: www.corcencc.org/explore). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context (see: https://www.corcencc.org/y-tiwtiadur/). The CorCenCC project was led by Dawn Knight (KnightD5@cardiff.ac.uk), at the Centre for Language and Communication Research, Cardiff University. The full project team comprised: 1 Principal Investigator (PI – Dawn Knight), 2 Co-Investigators (CIs – Steve Morris and Tess Fitzpatrick), who made up, with the PI, the CorCenCC Management Team, a total of 7 other CIs and 8 Research Assistants/Associates over the course of the project. In addition, there were 11 advisory board members, 6 consultants (from 4 countries around the world), 2 PhD students, 4 Undergraduate summer placement students, 4 professional service support staff, 4 project ambassadors and 2 project volunteers. More information can be found on the project website: www.corcencc.orgDataset:The CorCenCC dataset includes 14,338,149 tokens (circa 11.2-million-words). The data in CorCenCC represents a wide range of contexts, genres and topics. This data has, as far as possible, been anonymised using a combination of manual and automated techniques, and has been fully tagged in terms of part-of-speech (POS) and semantic categories. The POS and semantic tagging was carried out using CyTag and SemCyTag tools, available from CorCenCC’s GitHub website: https://github.com/CorCenCCThe following files are included in this dataset:categorisation_guide: guide to interpreting columns in CorCenCC’s corpus tables/files. categorization: links individual contribution_id’s to specific taxonomy_id’s (from the corpus design frame). Refer to taxonomy file for details. complete_corpus: zipped folder containing all individual contribution files (data is fully POS and semantic tagged).contrib_links: linking specific contributor_id’s to individual contributions. contribution: list of all contributions in the corpus (linking to specific modes). contributor: contributor metadata for the complete corpus.corpus_data: fully POS and semantically tagged CorCenCC corpus data. electronic: metadata associated with individual contribution_id’s (electronic mode).spoken: metadata associated with individual contribution_id’s (spoken mode).taxonomy: metadata taxonomy guide, used as a basis for classifying contributions according to their genre, context, location, target audience, topic, who (i.e. interlocutors), and source. written: metadata associated with individual contribution_id’s (written mode).The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Funding information:The research on which this dataset, the accompanying software tools, and online corpus resource, are based, was funded by the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) as the Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project (Grant Number ES/M011348/1).
c
ItAnt Multilingual Lexicon
dspace-clarin-it.ilc.cnr.it
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesca Murano; Valeria Quochi; Luca Rigobianco; Mariarosaria Zinzi; Silvia Piccini; Edoardo Middei; Greta Mozzato; Michela Bandini; Carmelina Toscano; Francesco Benassai (2024). ItAnt Multilingual Lexicon [Dataset]. https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/OPEN-1030
Explore at:
Dataset updated
Oct 16, 2024
Authors
Francesca Murano; Valeria Quochi; Luca Rigobianco; Mariarosaria Zinzi; Silvia Piccini; Edoardo Middei; Greta Mozzato; Michela Bandini; Carmelina Toscano; Francesco Benassai
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The ItAnt lexicon v.1 is a lexicon for the Restsprachen of Ancient Italy. It encodes lexical entries for four such languages, namely Oscan, Venetic, Neo-faliscan and Cisalpine Celtic. Entries are encoded at a morphosyntactic, semantic and etymological levels. Attestations are also encoded with links to text evidences contained in the digital inscription TEI EpiDoc corpora. For semantic encoding, we have adopted the classification of semantic fields proposed by Buck (1949) and formalised it as a SKOS taxonomy and Ontolex Lexical Concepts. Etymological information includes linking to the PIE and PIT roots, for which skeletal lexical entries are created, and to cognate words in sister languages such as Latin, Marrucinian Sabine, Vestinian, and many others. The lexicon is compliant with Web Semantic standards as it is modeled according to the Ontolex-lemon model and its extensions. This RDF version in tutle format is exported from the DigItAnt platform and can be uploaded on any triplestore. The lexicon is interlinked with the ItAnt Bibliographic dataset, the ItAnt digital corpora of inscriptions, the LiLa Knowledge Base (https://lila-erc.eu/data-page/), the IE Lexicon (https://lrc.la.utexas.edu/lex).
Z
Data from: UNESCO's Proceedings, 1945-2017: A Bilingual Digital Text Corpus
data.niaid.nih.gov
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin, Benjamin George (2025). UNESCO's Proceedings, 1945-2017: A Bilingual Digital Text Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14786688
Explore at:
Dataset updated
Feb 1, 2025
Dataset provided by
Mohammadi Norén, Fredrik
Mähler, Roger
Martin, Benjamin George
Martin, Oriane Mathilde
Marklund, Andreas
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The minutes of the meetings of the General Conference of UNESCO offer a rich resource for research on global themes in the humanities. UNESCO has published the minutes of these meetings (the “verbatim record”) since 1947 in a series called Records of the General Conference: Proceedings. UNESCO makes a portion of the Proceedings volumes available online in PDF form via the UNESDOC digital library. These files make it possible for users to read selected volumes, but they do not allow for full-text searching, much less any more sophisticated computational text analysis methods.

This corpus assembles the texts of the “verbatim record” section from all issues of Proceedings from 1947 to 2017, in English and/or French, generating a text corpus that is machine-readable, accessible, and reusable for digital text analysis.

Proceedings was published in parallel English and French editions from 1947 to 1962. Since then, it has appeared in a single multilingual volume including interventions in UNESCO’s six official languages, four of which (Arabic, Chinese, Russian and Spanish) are translated into either English or French. We deploy a language-recognition algorithm to isolate the text sections in English and French, thus creating a single bilingual corpus of circa 21 millions words that includes all interventions made at these meetings.

Our Proceedings package on GitHub also includes: (1) the corpus, in both English and French; (2) code written to curate the corpus; (3) metadata files identifying each session and meeting; and (4) supplementary materials, such as documentation and quality control files. Our goal in creating this package has been to make this valuable source accessible for new forms of digital research. This corpus is, naturally, a preliminary version. Much work can still be done to fine-tune the language recognition and improve the quality of the corpus as a whole.

The text of Proceedings is available in Open Access under the Attribution-ShareAlike 3.0 IGO (CC-BY-SA 3.0 IGO) license, in the context of UNESCO's open access publications policy. Our corpus is published under the most recent version of the same license: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0 Deed).

This corpus and related materials were developed as part of the research project "International Ideas at UNESCO: Digital Approaches to Global Conceptual History" (INIDUN), led by Benjamin G. Martin at Uppsala University and funded by a grant from the Swedish Research Council (Vetenskapsrådet dnr. 2019-03278), 2020-2024. For more information, see: inidun.github.io, as well as the project repository on GitHub, which includes documentation and files related to the curating process.
I.Sicily: A digital corpus of the inscriptions of Ancient Sicily
zenodo.org
zip
Updated May 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Prag; Jonathan Prag; James Chartrand; James Cummings; James Cummings; James Chartrand (2021). I.Sicily: A digital corpus of the inscriptions of Ancient Sicily [Dataset]. http://doi.org/10.5281/zenodo.2556744
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2556744
Dataset updated
May 23, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonathan Prag; Jonathan Prag; James Chartrand; James Cummings; James Cummings; James Chartrand
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Sicily
Description
Trial release for Zenodo archiving
E
Data from: Digital library and corpus of historical Slovene IMP 1.1
live.european-language-grid.eu
binary format
Updated Jul 27, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Digital library and corpus of historical Slovene IMP 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8184
Explore at:
binary formatAvailable download formats
Dataset updated
Jul 27, 2014
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains extensive meta-data, per-page links to facsimiles, and hand-corrected transcriptions with structural and editorial annotations.

These texts were annotated to be used as a language corpus. In the corpus each word is marked-up with its modernised form, lemma, and morphosyntactic description (fine grained PoS tag). Note that the annotations are automatic, so they contain a fair amount of errors.

The digital library is available in source TEI P5 XML and derived HTML. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers, e.g. CWB and Sketch Engine. Note that the vertical format does not contain all the information from the source TEI.
Corpus Christi Bay (G310) Bathymetric Digital Elevation Model - NOAA/NOS...
catalog.data.gov
s.cnmilf.com
+1more
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA National Centers for Environmental Information (Point of Contact) (2024). Corpus Christi Bay (G310) Bathymetric Digital Elevation Model - NOAA/NOS Estuarine Bathymetry [Dataset]. https://catalog.data.gov/dataset/corpus-christi-bay-g310-bathymetric-digital-elevation-model-noaa-nos-estuarine-bathymetry1
Explore at:
Dataset updated
Oct 18, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
Description
A 1/3 arc-second Mean Lower Low Water bathymetric DEM of NOS hydrographic survey data in Corpus Christi Bay.
m
Data from: Working with a linguistic corpus using R: An introductory note...
bridges.monash.edu
researchdata.edu.au
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Karlina Denistia; I Made Rajeg (2023). Working with a linguistic corpus using R: An introductory note with Indonesian Negating Construction [Dataset]. http://doi.org/10.4225/03/5a7ee2ac84303
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4225/03/5a7ee2ac84303
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg; Karlina Denistia; I Made Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This is a repository for codes and datasets for the open-access paper in Linguistik Indonesia, the flagship journal for the Linguistic Society of Indonesia (Masyarakat Linguistik Indonesia [MLI]) (cf. the link in the references below).To cite the paper (in APA 6th style):Rajeg, G. P. W., Denistia, K., & Rajeg, I. M. (2018). Working with a linguistic corpus using R: An introductory note with Indonesian negating construction. Linguistik Indonesia, 36(1), 1–36. doi: 10.26499/li.v36i1.71To cite this repository:Click on the Cite (dark-pink button on the top-left) and select the citation style through the dropdown button (default style is Datacite option (right-hand side)This repository consists of the following files:1. Source R Markdown Notebook (.Rmd file) used to write the paper and containing the R codes to generate the analyses in the paper.2. Tutorial to download the Leipzig Corpus file used in the paper. It is freely available on the Leipzig Corpora Collection Download page.3. Accompanying datasets as images and .rds format so that all code-chunks in the R Markdown file can be run.4. BibLaTeX and .csl files for the referencing and bibliography (with APA 6th style). 5. A snippet of the R session info after running all codes in the R Markdown file.6. RStudio project file (.Rproj). Double click on this file to open an RStudio session associated with the content of this repository. See here and here for details on Project-based workflow in RStudio.7. A .docx template file following the basic stylesheet for Linguistik IndonesiaPut all these files in the same folder (including the downloaded Leipzig corpus file)!To render the R Markdown into MS Word document, we use the bookdown R package (Xie, 2018). Make sure this package is installed in R.Yihui Xie (2018). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.6.
E
German Innsbruck Corpus (GermInnC) 1800-1950
live.european-language-grid.eu
data.niaid.nih.gov
txt
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). German Innsbruck Corpus (GermInnC) 1800-1950 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7729
Explore at:
txtAvailable download formats
Dataset updated
Apr 30, 2024
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
Innsbruck
Description
A digital corpus on variation in German (1800-1950)The German Innsbruck Corpus (GermInnC) 1800-1950 is a digitised corpus built after the fashion of the German Manchester Corpus (GerManC) 1650-1800 (cf. Scheible et al. 2011; Durrell et al. 2012). Hence, the corpus design of the GermInnC is balanced according to period, region and genre.The GermInnC consists of ca. 840,000 tokens, ca. 120,000 per genre (seven in total: Drama, Humanities, Legal texts, Narrative prose, Newspapers, Scientific texts, Sermons). It is subdivided into three periods, 1800-1850, 1851-1900 und 1901-1950, as well as five regions, North German, West Central German, East Central German, West Upper German (including Switzerland), East Upper German (including Austria).The corpus can be retrieved in a raw version, a lemmatised, fully-annotated version, or an “all data” file (including metadata annotation of file names and periods) for further import and processing. The Stuttgart Tag Set (STTS) and the POS-Tagger TreeTagger was used for linguistic annotation.Two documentation files (word and excel, both included in the download package), provide a more detailed description of the corpus and the digitisation.The corpus may be of interest to all scholars working on the history of the German language, standardisation of German, variation and change, historical sociolinguistics, and Germanic linguistics.The corpus was generously funded by the early career funding of the University of Innsbruck (October 2018 through September 2019).

Facebook

Twitter

Click to copy link

Link copied

Cite

Simson Garfinkel (2021). DigitalCorpora [Dataset]. https://registry.opendata.aws/digitalcorpora/

DigitalCorpora

Explore at:

383 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jan 27, 2021

Dataset provided by

<a href="https://simson.net/">Simson Garfinkel</a>

Description

Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at http://www.simson.net/clips/academic/2009.DFRWS.Corpora.pdf

Clear search

Close search

Google apps

Main menu

DigitalCorpora

Digital Corpora PDFs as Images

DECM Annotated Corpus

The Curated Courier: Digital Text Corpora from the UNESCO Courier...

Data from 'Dative absolutes in discourse: the value of deeply versus...

Axolotl-Spanish-Nahuatl

Paraly: Replication package for exploring the concept of paralysis (fr....

Dataset for \"Quantifying the quantitative (re-)turn in historical...

Corpus Christi, Texas 1/3 arc-second MHW Coastal Digital Elevation Model

Accelerating Digital Skills for Music Researchers - Processing Text-Based...

Data from 'One question, different annotation depths: A case study in Early...

Credibility Corpus with several datasets (Twitter, Web database) in French...

CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of...

ItAnt Multilingual Lexicon

Data from: UNESCO's Proceedings, 1945-2017: A Bilingual Digital Text Corpus

I.Sicily: A digital corpus of the inscriptions of Ancient Sicily

Data from: Digital library and corpus of historical Slovene IMP 1.1

Corpus Christi Bay (G310) Bathymetric Digital Elevation Model - NOAA/NOS...

Data from: Working with a linguistic corpus using R: An introductory note...

German Innsbruck Corpus (GermInnC) 1800-1950

DigitalCorpora