Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at http://www.simson.net/clips/academic/2009.DFRWS.Corpora.pdf
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
I needed a dataset of PDF files as images for a project and couldn't find another source online so I decided to make my own. The dataset currently consists of roughly 30k JPG images, but more might be added in the future.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The DECM Corpus is a digital corpus of the texts of Relaciones Geográficas de Nueva España (the Geographic Reports of New Spain) with different versions, including a machine ready version, a gold standard annotated dataset, and an automatically annotated version ready for text mining and machine learning experiments.This is the version of the entire RG corpus automatically annotated using the ML models trained with the DECM Gold Standard Corpus. The files are available in JSON and TSV format, and it also contains the file for the DECM Ontology. This corpus can be further used for quantitative and qualitative research, as well as advanced analyses using text mining techniques, corpus linguistics and other methods such as Geographical Text Analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Founded in 1948 as the official magazine of the United Nations Educational, Scientific and Cultural Organization, The UNESCO Courier represents an extraordinary resource for research on global themes in the humanities. The complete archive of the magazine is available in PDF form through UNESCO. These files make it possible for users anywhere to read individual issues, but it does not allow for full-text searching, much less any of the computational text analysis methods that have recently made important advances in humanities research. The Curated Courier 1.0 is a package of digital text corpora, text analysis tools, and supplementary materials that makes the complete archive of The UNESCO Courier from 1948 to 2020 machine-readable, accessible, and reusable for digital text analysis. Here on Zenodo we publish two Courier corpora. The first corpus (curated_courier_article_corpus) consists of the texts of all articles published in the English-language edition of The UNESCO Courier between 1948 and 2020. For this corpus we have extracted and reconstructed the complete text of all articles, for example by pulling together non-contiguous pages where necessary and by removing non-article text (masthead, photo captions, letters to the editor, and so on). We have linked each article to a comprehensive curated metadata index, included in the download (document_index.csv). The second corpus (curated_issues) compiles the complete text of all Courier issues (English-language edition), 1948-2020. To prepare this corpus we extracted text from the PDFs that UNESCO has made available, used multiple modes of OCR, and rendered each issue as a simple text file. Our test of the OCR quality finds an average error rate of 0.7 %, which should be considered good quality. Working data from the process can be found in our GitHub repository "tagged Courier." The products, text analysis tools, and additional documentation are in the repository "Curated Courier." The text of The UNESCO Courier is available in Open Access under the Attribution-ShareAlike 3.0 IGO (CC-BY-SA 3.0 IGO) license, in the context of UNESCO's open access publications policy. This dataset is published under the most recent version of the same license: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0 Deed). These datasets was developed as part of the research project "International Ideas at UNESCO: Digital Approaches to Global Conceptual History" (INIDUN), led by Benjamin G. Martin at Uppsala University and funded by a grant from the Swedish Research Council (Vetenskapsrådet), 2020-2024. For more information, see: https://inidun.github.io, as well as the project repository on GitHub, which includes documentation and files related to the curating process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project contains all the datasets used in the paper 'Early Slavic dative absolutes in discourse: the value of deeply versus strategically annotated treebanks'.- 'egda_raw.csv' contains all egda-clauses in the Codex Marianus. The only part which has been manipulated is where two subjects were coordinated by i 'and'. In these cases, an extra row was created, allowing both subjects to appear in the ocs_sub_lemma column. The row containing the second subject was left empty under all but the subject lemma variable. This allows to observe frequencies regarding lexical variation among egda-clauses' subjects, but at the same time to discard those rows when dealing with other variables.- 'egda_manipulated.csv' considers all bystъ-clauses as pre-matrix.- 'DA_Marianus_raw.csv' contains all dative absolutes in the Codex Marianus, as well as genitive absolutes for which there is an OCS parallel. It lists as separate entries both multiple dative participles with one dative subjects, and multiple dative subjects with one dative participle.E.g.:1) бꙑвъши же печали и гонению словесе ради абье съблажнѣатъ сѧ2) и въшедъши дъштери еѩ иродиѣдѣ. i плѧсавъши и оугождъши иродовиBoth 1) and 2) are listed as multiple entries, although only 2) has technically more than one dative absolute.DA_Marianus_abridged.csv: This is the same as DA_Marianus_raw.csv, but lists as one dative absolute instances with multiple dative subjects and one dative participle. The criterion chosen was to only retain the entry for the subject which was the closest to the participle (the choice can make a difference should one want to consider the properties of a dative absolute with respect to its subjects).- 'DA_Marianus_manipulated.csv' (starting from DA_Marianus_abridged.csv) treats all dative absolutes in bystъ-clauses as pre-matrix.- 'DA_nogr_raw.csv' contains all the dative absolutes in the second case study (early Slavic texts without Greek parallels)- 'DA_nogr_harm.csv' contains the same dative absolutes as DA_nogr_raw.csv but with harmonized Church Slavonic and Old East Slavic spellings.- harmonize.py: script used to harmonize the Church Slavonic and Old East Slavic spellings in the paper's second case study.The reader interested in reproducing the results of the paper should refer to the 'manipulated' versions of both the egda-clause and the dative absolute datasets.
https://choosealicense.com/licenses/mpl-2.0/https://choosealicense.com/licenses/mpl-2.0/
Axolotl-Spanish-Nahuatl : Parallel corpus for Spanish-Nahuatl machine translation
Dataset Collection
In order to get a good translator, we collected and cleaned two of the most complete Nahuatl-Spanish parallel corpora available. Those are Axolotl collected by an expert team at UNAM and Bible UEDIN Nahuatl Spanish crawled by Christos Christodoulopoulos and Mark Steedman from Bible Gateway site. After this, we ended with 12,207 samples from Axolotl due to misalignments and… See the full description on the dataset page: https://huggingface.co/datasets/somosnlp-hackathon-2022/Axolotl-Spanish-Nahuatl.
This replication package provides all necessary resources to reproduce the dataset and methodological approach described in the Paraly data paper. The dataset consists of three corpora (full texts and metadata) of French literature from the 18th, 19th, and 20th centuries, containing both figurative and concrete linguistic references (annotations) to the concept of paralysis. The texts originate from the “Les classiques de la littérature” collection maintained on Gallica, the digital library of the Bibliothèque nationale de France (BnF). The replication package includes scripts and documentation for data collection, extraction, processing, annotation, and model training. It contains: scripts for data and metadata collection, original OCR-ed texts with metadata from Gallica, text excerpts containing the character sequence “paraly” and their manual annotations, annotation guidelines detailing the methodology used, a pre-trained multilabel classifier trained on the annotated data using the flair library, a graphical user interface application for automatic annotation, code and workflows for processing text corpora. By providing these resources, the replication package enables researchers to reproduce the dataset creation process, refine the annotation workflow, and extend the methodological approach to other literary corpora.
This dataset contains the data analysed in the article "Quantifying the quantitative (re-)turn in historical linguistics" authored by Barbara McGillivray and Gard Jenset and published in the journal "Humanities and Social Sciences Communications" in 2023. The dataset contains our analysis of 63 articles published in 2018 in six historical linguistics journals (Diachronica, Folia Linguistica Historica, Journal of Historical Linguistics, Language Dynamics and change, Language variation and change, and Transactions of the Philological Society). We recorded the following information: the type evidence base used in the paper (digital corpora, word lists, examples, etc.) and the statistical techniques used for the analysis, if any (t-tests, regression models, principal component analysis, etc.). We then classified the articles across two dimensions: corpus-based vs. non corpus-based and quantitative vs. non quantitative.
NOAA's National Geophysical Data Center (NGDC) is building high-resolution digital elevation models (DEMs) for select U.S. coastal regions. These integrated bathymetric-topographic DEMs are used to support tsunami forecasting and modeling efforts at the NOAA Center for Tsunami Research, Pacific Marine Environmental Laboratory (PMEL). The DEMs are part of the tsunami forecast system SIFT (Short-term Inundation Forecasting for Tsunamis) currently being developed by PMEL for the NOAA Tsunami Warning Centers, and are used in the MOST (Method of Splitting Tsunami) model developed by PMEL to simulate tsunami generation, propagation, and inundation. Bathymetric, topographic, and shoreline data used in DEM compilation are obtained from various sources, including NGDC, the U.S. National Ocean Service (NOS), the U.S. Geological Survey (USGS), the U.S. Army Corps of Engineers (USACE), the Federal Emergency Management Agency (FEMA), and other federal, state, and local government agencies, academic institutions, and private companies. DEMs are referenced to the vertical tidal datum of Mean High Water (MHW) and horizontal datum of World Geodetic System 1984 (WGS84). Grid spacings for the DEMs range from 1/3 arc-second (~10 meters) to 3 arc-seconds (~90 meters).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing three subgenre-specific .xlsx files for the exercises in Episode 2 of the Processing Text-Based Corpora for Musical Discourse Analysis lesson of the Accelerating Digital Skills for Music Researchers project. The original data was collected from Boomkat.com with permission.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project contains all the datasets and scripts used for the paper:
Pedrazzini, Nilo. 2022. One question, different annotation depths: A case study in Early Slavic. Journal of Historical Syntax (Special Collection 'Annotating Historical Corpora') 6(7). 1-40. DOI: 10.18148/hs/2022.v6i4-11.96
Content: - das_marianus.csv: all dative absolutes found through TOROT in the Codex Marianus. Used for case study 1 (on 'deeply-annotated treebanks', Section 1 of the paper). - xadvs_marianus.csv: all conjunct participles found through TOROT in the Codex Marianus. Used for case study 1 (on 'deeply-annotated treebanks', Section 1 of the paper). - absdat_nogr.csv: all dative absolutes found through TOROT (except Codex Marianus), as of June 2020. Used for case study 2 (on 'shallowly-annotated treebanks', Section 2 of the paper). - bdinski_da.csv: dative absolutes found in Story of Abraham of Qidun and his niece Mary (Bdinski Sbornik). Used for case study 3 (on 'strategically-annotated treebanks', Section 3 of the paper). - JHS_Pedrazzini.R: R script for all the frequencies and plots in the paper. - harmon.py: script used to harmonize the Church Slavonic and Old East Slavic spellings in case study 2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The set of these datasets are made to analyze information credibility in general (rumor and disinformation for English and French documents), and occuring on the social web. Target databases about rumor, hoax and disinformation helped to collect obviously misinformation. Some topic (with keywords) helps us to made corpora from the micrroblogging platform Twitter, great provider of rumors and disinformation.1 corpus describes Texts from the web database about rumors and disinformation. 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French). 4 corpora from Social Media Twitter randomly built (2 in English, 2 in French). 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French).Size of different corpora :Social Web Rumorous corpus: 1,612French Hollande Rumorous corpus (Twitter): 371 French Lemon Rumorous corpus (Twitter): 270 English Pin Rumorous corpus (Twitter): 679 English Swine Rumorous corpus (Twitter): 1024French 1st Random corpus (Twitter): 1000 French 2st Random corpus (Twitter): 1000 English 3st Random corpus (Twitter): 1000 English 4st Random corpus (Twitter): 1000French Rihanna Event corpus (Twitter): 543 English Rihanna Event corpus (Twitter): 1000 French Euro2016 Event corpus (Twitter): 1000 English Euro2016 Event corpus (Twitter): 1000A matrix links tweets with most 50 frequent wordsText data :_id : message id body text : string text dataMatrix data :52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 11,102 lines (each line is a message)Hidalgo corpus: lines range 1:75 Lemon corpus : lines range 76:467 Pin rumor : lines range 468:656 swine : lines range 657:1311random messages : lines range 1312:11103Sample contains : French Pin Rumorous corpus (Twitter): 679 Matrix data :52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 189 lines (each line is a message)
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Overview:The CorCenCC corpus contains over 11 million words (circa 14.4m tokens) from written, spoken and electronic (online, digital texts) Welsh language sources, taken from a range of genres, language varieties (regional and social) and contexts. The contributors to CorCenCC are representative of the over half a million Welsh speakers in the country. The creation of CorCenCC was a community-driven project, which offered users of Welsh an opportunity to be proactive in contributing to a Welsh language resource that reflects how Welsh is currently used.To make CorCenCC as representative of contemporary Welsh as possible, the project team designed a bespoke sampling framework. Extracts were collected from sources including for example, journals, emails, sermons, road signs, TV programmes, meetings, magazines and books. Conversations were recorded by the research team, and a specially designed crowdsourcing app (see: https://www.corcencc.org/app/) enabled Welsh speakers in the community to record and upload samples of their own language use to the corpus. The published corpus therefore contains data from Welsh speakers from all kinds of backgrounds, abilities and contexts, capturing how Welsh is truly used today across the country.A beta version of some bilingual corpus query tools have also been created as part of the CorCenCC project (see: www.corcencc.org/explore). These include simple query, full query, frequency list, n-gram, keyword and collocation functionalities. The CorCenCC website also contains Y Tiwtiadur, a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises: Gap Filling (Cloze), Vocabulary Profiler, Word Identification and Word-in-Context (see: https://www.corcencc.org/y-tiwtiadur/). The CorCenCC project was led by Dawn Knight (KnightD5@cardiff.ac.uk), at the Centre for Language and Communication Research, Cardiff University. The full project team comprised: 1 Principal Investigator (PI – Dawn Knight), 2 Co-Investigators (CIs – Steve Morris and Tess Fitzpatrick), who made up, with the PI, the CorCenCC Management Team, a total of 7 other CIs and 8 Research Assistants/Associates over the course of the project. In addition, there were 11 advisory board members, 6 consultants (from 4 countries around the world), 2 PhD students, 4 Undergraduate summer placement students, 4 professional service support staff, 4 project ambassadors and 2 project volunteers. More information can be found on the project website: www.corcencc.orgDataset:The CorCenCC dataset includes 14,338,149 tokens (circa 11.2-million-words). The data in CorCenCC represents a wide range of contexts, genres and topics. This data has, as far as possible, been anonymised using a combination of manual and automated techniques, and has been fully tagged in terms of part-of-speech (POS) and semantic categories. The POS and semantic tagging was carried out using CyTag and SemCyTag tools, available from CorCenCC’s GitHub website: https://github.com/CorCenCCThe following files are included in this dataset:categorisation_guide: guide to interpreting columns in CorCenCC’s corpus tables/files. categorization: links individual contribution_id’s to specific taxonomy_id’s (from the corpus design frame). Refer to taxonomy file for details. complete_corpus: zipped folder containing all individual contribution files (data is fully POS and semantic tagged).contrib_links: linking specific contributor_id’s to individual contributions. contribution: list of all contributions in the corpus (linking to specific modes). contributor: contributor metadata for the complete corpus.corpus_data: fully POS and semantically tagged CorCenCC corpus data. electronic: metadata associated with individual contribution_id’s (electronic mode).spoken: metadata associated with individual contribution_id’s (spoken mode).taxonomy: metadata taxonomy guide, used as a basis for classifying contributions according to their genre, context, location, target audience, topic, who (i.e. interlocutors), and source. written: metadata associated with individual contribution_id’s (written mode).The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Funding information:The research on which this dataset, the accompanying software tools, and online corpus resource, are based, was funded by the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) as the Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project (Grant Number ES/M011348/1).
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The ItAnt lexicon v.1 is a lexicon for the Restsprachen of Ancient Italy. It encodes lexical entries for four such languages, namely Oscan, Venetic, Neo-faliscan and Cisalpine Celtic. Entries are encoded at a morphosyntactic, semantic and etymological levels. Attestations are also encoded with links to text evidences contained in the digital inscription TEI EpiDoc corpora. For semantic encoding, we have adopted the classification of semantic fields proposed by Buck (1949) and formalised it as a SKOS taxonomy and Ontolex Lexical Concepts. Etymological information includes linking to the PIE and PIT roots, for which skeletal lexical entries are created, and to cognate words in sister languages such as Latin, Marrucinian Sabine, Vestinian, and many others. The lexicon is compliant with Web Semantic standards as it is modeled according to the Ontolex-lemon model and its extensions. This RDF version in tutle format is exported from the DigItAnt platform and can be uploaded on any triplestore. The lexicon is interlinked with the ItAnt Bibliographic dataset, the ItAnt digital corpora of inscriptions, the LiLa Knowledge Base (https://lila-erc.eu/data-page/), the IE Lexicon (https://lrc.la.utexas.edu/lex).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The minutes of the meetings of the General Conference of UNESCO offer a rich resource for research on global themes in the humanities. UNESCO has published the minutes of these meetings (the “verbatim record”) since 1947 in a series called Records of the General Conference: Proceedings. UNESCO makes a portion of the Proceedings volumes available online in PDF form via the UNESDOC digital library. These files make it possible for users to read selected volumes, but they do not allow for full-text searching, much less any more sophisticated computational text analysis methods.
This corpus assembles the texts of the “verbatim record” section from all issues of Proceedings from 1947 to 2017, in English and/or French, generating a text corpus that is machine-readable, accessible, and reusable for digital text analysis.
Proceedings was published in parallel English and French editions from 1947 to 1962. Since then, it has appeared in a single multilingual volume including interventions in UNESCO’s six official languages, four of which (Arabic, Chinese, Russian and Spanish) are translated into either English or French. We deploy a language-recognition algorithm to isolate the text sections in English and French, thus creating a single bilingual corpus of circa 21 millions words that includes all interventions made at these meetings.
Our Proceedings package on GitHub also includes: (1) the corpus, in both English and French; (2) code written to curate the corpus; (3) metadata files identifying each session and meeting; and (4) supplementary materials, such as documentation and quality control files. Our goal in creating this package has been to make this valuable source accessible for new forms of digital research. This corpus is, naturally, a preliminary version. Much work can still be done to fine-tune the language recognition and improve the quality of the corpus as a whole.
The text of Proceedings is available in Open Access under the Attribution-ShareAlike 3.0 IGO (CC-BY-SA 3.0 IGO) license, in the context of UNESCO's open access publications policy. Our corpus is published under the most recent version of the same license: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0 Deed).
This corpus and related materials were developed as part of the research project "International Ideas at UNESCO: Digital Approaches to Global Conceptual History" (INIDUN), led by Benjamin G. Martin at Uppsala University and funded by a grant from the Swedish Research Council (Vetenskapsrådet dnr. 2019-03278), 2020-2024. For more information, see: inidun.github.io, as well as the project repository on GitHub, which includes documentation and files related to the curating process.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Trial release for Zenodo archiving
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains extensive meta-data, per-page links to facsimiles, and hand-corrected transcriptions with structural and editorial annotations.
These texts were annotated to be used as a language corpus. In the corpus each word is marked-up with its modernised form, lemma, and morphosyntactic description (fine grained PoS tag). Note that the annotations are automatic, so they contain a fair amount of errors.
The digital library is available in source TEI P5 XML and derived HTML. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers, e.g. CWB and Sketch Engine. Note that the vertical format does not contain all the information from the source TEI.
A 1/3 arc-second Mean Lower Low Water bathymetric DEM of NOS hydrographic survey data in Corpus Christi Bay.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This is a repository for codes and datasets for the open-access paper in Linguistik Indonesia, the flagship journal for the Linguistic Society of Indonesia (Masyarakat Linguistik Indonesia [MLI]) (cf. the link in the references below).To cite the paper (in APA 6th style):Rajeg, G. P. W., Denistia, K., & Rajeg, I. M. (2018). Working with a linguistic corpus using R: An introductory note with Indonesian negating construction. Linguistik Indonesia, 36(1), 1–36. doi: 10.26499/li.v36i1.71To cite this repository:Click on the Cite (dark-pink button on the top-left) and select the citation style through the dropdown button (default style is Datacite option (right-hand side)This repository consists of the following files:1. Source R Markdown Notebook (.Rmd file) used to write the paper and containing the R codes to generate the analyses in the paper.2. Tutorial to download the Leipzig Corpus file used in the paper. It is freely available on the Leipzig Corpora Collection Download page.3. Accompanying datasets as images and .rds format so that all code-chunks in the R Markdown file can be run.4. BibLaTeX and .csl files for the referencing and bibliography (with APA 6th style). 5. A snippet of the R session info after running all codes in the R Markdown file.6. RStudio project file (.Rproj). Double click on this file to open an RStudio session associated with the content of this repository. See here and here for details on Project-based workflow in RStudio.7. A .docx template file following the basic stylesheet for Linguistik IndonesiaPut all these files in the same folder (including the downloaded Leipzig corpus file)!To render the R Markdown into MS Word document, we use the bookdown R package (Xie, 2018). Make sure this package is installed in R.Yihui Xie (2018). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.6.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A digital corpus on variation in German (1800-1950)The German Innsbruck Corpus (GermInnC) 1800-1950 is a digitised corpus built after the fashion of the German Manchester Corpus (GerManC) 1650-1800 (cf. Scheible et al. 2011; Durrell et al. 2012). Hence, the corpus design of the GermInnC is balanced according to period, region and genre.The GermInnC consists of ca. 840,000 tokens, ca. 120,000 per genre (seven in total: Drama, Humanities, Legal texts, Narrative prose, Newspapers, Scientific texts, Sermons). It is subdivided into three periods, 1800-1850, 1851-1900 und 1901-1950, as well as five regions, North German, West Central German, East Central German, West Upper German (including Switzerland), East Upper German (including Austria).The corpus can be retrieved in a raw version, a lemmatised, fully-annotated version, or an “all data” file (including metadata annotation of file names and periods) for further import and processing. The Stuttgart Tag Set (STTS) and the POS-Tagger TreeTagger was used for linguistic annotation.Two documentation files (word and excel, both included in the download package), provide a more detailed description of the corpus and the digitisation.The corpus may be of interest to all scholars working on the history of the German language, standardisation of German, variation and change, historical sociolinguistics, and Germanic linguistics.The corpus was generously funded by the early career funding of the University of Innsbruck (October 2018 through September 2019).
Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at http://www.simson.net/clips/academic/2009.DFRWS.Corpora.pdf