Dataset description: The General Regionally Annotated Corpus of Ukrainian (GRAC, Shvedova et al. 2017-2024, uacorpus.org) was consulted to collect data for further analysis concerning the distribution of Singular vs. Plural verb forms in the target bahato construction. GRAC is a Sketch Engine corpus of over 1.8 billion words, representing texts from over 30,000 authors created between 1816 and 2023. This corpus is designed to serve as source material for linguistic research on Standard Ukrainian. Our data was collected during the month of February 2024. We extracted and annotated 28,491 examples of the bahato construction. An additional set of examples was collected from the Russian National Corpus (ruscorpora.ru) during the month of August 2024 to provide comparison with the Russian mnogo construction. For this purpose, 6,612 examples were extracted and annotated for word order and Singular vs. Plural verb agreement. Both the Ukrainian and the Russian data are included in this dataset, along with the R scripts used to analyze this data. Article abstract: We reveal an ongoing language change in Ukrainian involving a construction with a subject comprised of the indefinite quantifier багато ‘many’ modifying a noun phrase in the Genitive Plural. Number agreement on the verb varies, allowing both Singular (in 69.1% of attestations) and Plural (in 30.9% of attestations). Based on statistical analysis of corpus data, we investigate the influence of the factors of year of creation, word order of subject and verb, and animacy of the subject on the choice of verb number. We find that, while all combinations of word order and animacy are robustly attested, VS word order and inanimate subjects tend to prefer Singular, whereas SV word order and animate subjects tend to prefer Plural. Since about the 1950s, the proportion of Plural has been increasing, overtaking Singular in the current decade. We propose that this Singular vs. Plural variation is motivated by the human embodied experience of construing a group of items as either a homogeneous mass (and therefore Singular) or a multiplicity of individuals (and therefore Plural). This proposal is supported by the identification of micro-constructions that prefer Singular and show reduced individuation of human beings.
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Fortran and Matlab programs, Matlab mex file of Fortran program, compiled mex file, and sample data files, etc. for computing a partial elastic shape registration of two simple surfaces in 3-dimensional space and the elastic shape distance between them corresponding to the partial registration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects and is filtered where the books includes Singular quadratic forms in perturbation theory, featuring 10 columns including authors, average publication date, book publishers, book subject, and books. The preview is ordered by number of books (descending).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The posterior means for singular values for the BGGE and BGGEE models as a function of the number of bilinear terms (k = 1,2,…, 7).
This is tomography data as acquired using a commercial X-ray tomography instrument. We obtained reconstructions of a graded-index optical fiber with voxels of edge length 1.05 µm at 12 tube voltages. The fiber manufacturer created a graded index in the central region by varying the germanium concentration from a peak value in the center of the core to a very small value at the core-cladding boundary. Operating on 12 tube voltages, we show by a singular value decomposition that there are only two singular vectors with significant weight. Physically, this means scans beyond two tube voltages contain largely redundant information. We concentrate on an analysis of the images associated with these two singular vectors. The first singular vector is dominant and images of the coefficients of the first singular vector at each voxel look are similar to any of the single-energy reconstructions. Images of the coefficients of the second singular vector by itself appear to be noise. However, by averaging the reconstructed voxels in each of several narrow bands of radii, we can obtain values of the second singular vector at each radius. In the core region, where we expect the germanium doping to go from a peak value at the fiber center to zero at the core-cladding boundary, we find that a plot of the two coefficients of the singular vectors forms a line in the two-dimensional space consistent with the dopant decreasing linearly with radial distance from the core center. The coating, made of a polymer rather than silica, is not on this line indicating that the two-dimensional results are sensitive not only to the density but also to the elemental composition. A stack of reconstructions are given here as tiff files of individual slices. Each zip file corresponds to a tilt series at a given tube voltage, given in the file name. The power is also given in the file name. (For example, file “30kV-2W.zip” was tube voltage at 30kV, power 2W.) The power was varied so that the signal-to-noise was approximately equal for the various reconstructions. The experiment is described in: ZH Levine, AP Peskin, EJ Garboczi, and AD Holmgren, Multi-Energy X-Ray Tomography of an Optical Fiber: The Role of Spatial Averaging, Microscopy and Microanalysis 25 (1) 70-76 (2019). https://doi.org/10.1017/S1431927618016136
A growing body of work in psycholinguistics suggests that morphological relations between word forms affect the processing of complex words. Previous studies have usually focused on a particular type of paradigmatic relation, for example the relation between paradigm members, or the relation between alternative forms filling a particular paradigm cell. However, potential interactions between different types of paradigmatic relations have remained relatively unexplored. The data in in this data set were used in two corpus studies of variable plurals in Dutch to test hypotheses about potentially interacting paradigmatic effects. The first study (which uses the s_dist data) shows that generalization across noun paradigms predicts the distribution of plural variants, and that this effect is diminished for paradigms in which the plural variants are more likely to have a strong representation in the mental lexicon. The second study (which uses the s_dur data) demonstrates that the pronunciation of a target plural variant is affected by coactivation of the alternative variant, resulting in shorter segmental durations. This effect is dependent on the representational strength of the alternative plural variant. In sum, the distributional and durational measurements in these data provide evidence that storage of morphologically complex words may affect the role of generalization and coactivation during production. A full description of the data gathering process and the analyses is given in the Methodology file. The Readme file describes how the remaining files relate to the research.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset accompanies a paper to be published in "Morphology" (JOMO, Springer). Under the present DOI, all data generated for this research as well as all scripts used are stored. The paper itself is not CC-licensed, refer to Springer's "Morphology" website for details!
Abstract
In this paper, we take a closer theoretical and empirical look at the linking elements in German N1+N2 compounds which are identical to the plural marker of N1 (such as -er with umlaut, as in Häus-er-meer 'sea of houses'). Various perspectives on the actual extent of plural interpretability of these pluralic linking elements are expressed in the literature. We aim to clarify this question by empirically examining to what extent there may be a relationship between plural form and meaning which informs in which sorts of compounds pluralic linking elements appear. Specifically, we investigate whether pluralic linking elements occur especially frequently in compounds where a plural meaning of the first constituent is induced either externally (through plural inflection of the entire compound) or internally (through a relation between the constituents such that N2 forces N1 to be conceptually plural, as in the example above). The results of a corpus study using the DECOW16A corpus and a split-100 experiment show that in the internal but not external plural meaning conditions, a pluralic linking element is preferred over a non-pluralic one, though there is considerable inter-speaker variability, and limitations imposed by other constraints on linking element distribution also play a role. However, we show the overall tendency that German language users do use pluralic linking elements as cues to the plural interpretation of N1+N2 compounds. Our interpretation does not reference a specific morphological framework. Instead, we view our data as strengthening the general approach of probabilistic morphology.
A set of recorded isolated nouns, verbs and image annotations used for testing the word recognition performance of our speech2image model.
We trained a word recognition model on a set of images and utterances. The model should learn to recognise words without ever having seen written transcripts. The word recognition performance is measured as the number of retrieved images out of 10 displaying the correct visual referent.
We furthermore collected new ground truth object and action annotations for the Flickr8k test images for this purposes. This consists of 1000 images, all annotated for the presence of the 50 actions and objects corresponding to the test verbs and nouns.
In order to test the word recognition performance we took the 50 most common nouns and 50 most common verbs in the training data, confirmed that there were at least 10 images in our test image data that displayed these actions and objects. These nouns and verbs where recorded in singular and plural form (nouns) and in root, third person and progressive form (verbs). We furthermore annotated 1000 images from the Flickr8k test set for the presence of these nouns and verbs. These annotations are included in .CSV format
Table of content: 1. Frequency of early concepts; 2. Frequency of additional concepts; 3. Use of any early concept; 4. Use of any additional concept, 5. Planning steps; 6. Protocol. The present dataset is part of the published scientific paper entitled “Landscape ecological concepts in planning: review of recent developments” (Hersperger et al., 2021). The goal of this research was to review recent publications to assess the use of landscape ecological concepts in planning. Specifically, we address the following research questions: Q1. Landscape ecological concepts: What are they? How frequently are they mentioned in current research? Q2. How are landscape ecological concepts integrated in landscape planning? We analysed all empirical and overview papers that have been published in four key academic journals in the field of landscape ecology and landscape planning in the years 2015–2019 (n = 1918). Four key journals in the field of landscape ecology were selected to conduct the analysis, respectively Landscape Ecology (LE), Landscape Online (LO), Current Landscape Ecology Reports (CLER), and Landscape and Urban Planning (LUP). The title, abstract and keywords of all papers were read in order to identify landscape ecological concepts. Then, all 1918 papers went through a keyword search to identify the use of early and additional concepts. We used the “pdfsearch” package in R programming language and searched for singular and plural forms and different variations of the concepts (see Supplementary material 1, Table A). As a result, we provided four outputs: 1. Frequency of early concepts. This data provides the total number of times each article used each early concept (Q1). This data was used to produce the Figure 2a at the original publication. 2. Frequency of additional concepts. This data provides the total number of times each article used each additional concept (Q1). This data was used to produce the Figure 2b at the original publication. 3. Use of any early concept. This data provides the total number of times each article used any early concept (Q1). This data was used to produce the Figure 3a at the original publication. 4. Use of any additional concept. This data provides the total number of times each article used any additional concept (Q1). This data was used to produce the Figure 3b at the original publication. To address the second question (Q2), the title, abstract and keywords of the papers included in our sample (n=1918 articles) were screened to identify papers that might show how landscape ecological concepts are integrated into planning. We selected 52 empirical papers (see Supplementary material – 4 Integration of landscape ecological concepts into planning), and we provided two outputs: 5. Planning steps. This data provides the number of times landscape ecological concepts were addressed in each planning steps in 52 empirical papers analysed in detail (Q2). This data was used to produce the Figure 4 at the original publication. 6. Protocol for assessing the integration of landscape ecological concepts into planning. To systematically collect the data, we used this protocol which addressed the following questions: (a) which type of planning is addressed by the paper? (b) to which planning level does the paper refer to? (c) which concepts are integrated in any of the planning steps described above?
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms.
Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This site contains the widefield imaging datasets from the publication Cortical State Fluctuations during Sensory Decision Making, by Jacobs et al in Current Biology.This data is from the behavioural tasks described in the publication, and is in a compressed SVD format (see Methods in the publication for more details). The companion code is designed to take the data in this format.The datasets provided here contain the top 500 singular values, which is how the data in the publication was analysed, as this was found to sufficiently capture the data. The data contaning up to 2000 singular values can be shared on request.The timestamps of the datasets here are not all aligned with the behavioural datasets; the companion code takes care of this.The data is organised by experimental subject; most subjects were recorded from on multiple days, which form subfolders within the subject folder. Within a day, there may have been several experiments, which again form subfolders within the day folder. The companion code expects this data organisation.The companion code is available at: https://github.com/eakjacobs/Jacobs_et_al_CurrentBiologyFor more information and links to the behavioural and pupil datasets, please follow this link: https://doi.org/10.6084/m9.figshare.13084805The research article can be found (freely available) at https://www.cell.com/current-biology/fulltext/S0960-9822(20)31437-8
This archive contains a unique collection of naturalistic child language data collected between 2017 and 2020 in Southern Senegal. The deposit contains ELAN files of annotated data based on recordings of children's production and child directed speech in naturalistic settings. The language under investigation is Eegimaa, a Jóola language of sourthern Senegal. This is part of the Atlantic branch of the Niger‑Congo Phylum. The data was collected as part of a research project which investigates the acquisition of an Atlantic noun class system. Our research looks at the factors underlying children’s learning of nominal class prefixes and syntactic and semantic agreement at the level of the NP.
We focus on questions including the following.
• Which elements of noun class morphology do children begin to use productively?
• What is the role of input frequency, morphological salience, and transparency in children acquisition of noun class and agreement in Eegimaa?
• Are errors in the production of nominal class prefixes also reflected in children’s use of the corresponding agreement markers?
Theoretical accounts of the strategies used by children to learn the structures of words and grammatical features of languages differ considerably, but our knowledge of what is possible is limited by the existing focus on a relatively small number of languages associated with industrialised nations. Here, we will investigate grammatical features and structures that may be expressed in a variety of different ways. Examples of grammatical features include number, e.g. the distinction between singular and plural, or gender, e.g. distinguishing masculine and feminine in languages like French, features expressed within the shape of the word and associated items. Grammatical structure may be manifested in agreement across the separate words of a noun phrase (e.g. The cat purrs, where the -s on 'purrs' shows agreement with cat, indicating that there is only one cat.) This project investigates the acquisition of inflectional morphology, i.e., grammatical features and structures as reflected in the word forms and associated agreement, in Gújjolaay Eegimaa, a language of the Atlantic family of the Niger Congo phylum spoken in Southern Senegal. This language has a gender system of the type traditionally known as a noun class system. Noun class systems with complex gender agreement are characteristic of the Niger-Congo languages. In Eegimaa nouns use prefixes to form singular and plural. For example ba- is the singular marker for ba-ginh 'chest', but its plural marker is u- as in u-ginh 'chests'. Nouns which have the same singular prefix, e.g. ba-, can form their plural with a different marker (e.g., bá-jur 'young woman', plural sú-jur 'young women'). Eegimaa has a complex morphological system of gender and number marking which is also reflected in its agreement system. Current knowledge as to how children acquire gender/noun class marking and agreement is based entirely on the Bantu languages of the Niger Congo family. There are no studies available of Atlantic languages, which, though similar to Bantu in some ways, also have important differences. Here we will investigate the influence of the three factors found to affect children's acquisition of noun class morphology and agreement, namely: i) Input frequency, according to which the forms that children hear the most will tend to be acquired first ii) Perceptual salience, according to which more salient forms such as stressed syllables will tend to be acquired first, and iii) Morphological transparency, according to which forms whose meanings are easily determined will tend to be acquired more easily than those whose meanings are more obscure. Our study will build on findings on the acquisition of Bantu noun class systems, and will aim to answer questions such as the following. What strategies do children rely on to learn complex language structure? What is the role of adult input language in the acquisition of morphology and agreement in Eegimaa? How do children cope with variation in language input from their caregivers? In what order do they learn the different noun class markers? We will carry out a longitudinal study in which we will observe over three years the interactions of five children aged from about 2 to 4 years with their caregivers. Among Eegimaa speakers, caregivers include children's parents, older siblings and other members of the community. Children's daytime activities mostly take place outside their homes. We will record children's output speech on audio and video and compare the data with child-directed speech from adults and with adult-directed speech (interactions between adults), collected as part of a previous project. We will also carry out a cross-sectional study by twice observing the speech of ten additional children at two points, at ages 3 and 4 years. These studies together will provide both an in-depth look and a broader overview of the...
https://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/5KCE4Uhttps://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/5KCE4U
Dataset description This dataset, which is adapted from Jenset and McGillivray (2017), contains tabular files documenting the alternating usage of -(e)th and -(e)s to mark third-person verb inflection in Early Modern English. The data provided by Jenset and McGillivray (2017) are drawn from the PPCEME corpus (Kroch et al. 2004) and cover the period from 1500 to 1700. In total, 13,757 third-person singular tokens (excluding the verb BE) were annotated by these authors for a range of variables. For the purposes of the present methodological study, this dataset was reduced to a subset of 11,645 tokens, and the coding of variables was in some parts revised, completed, or modified. The dataset includes information about the Author and Verb Lemma, as well as a number of predictor variables, including Genre, Year, Frequency (of the verb lemma in the third-person singular), Phonological Context (stem-final sound), and the Gender of the author. Abstract for related publication Resource constraints often force researchers to down-size the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: Year, Gender, Genre, Frequency, and Phonological Context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 sub-samples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This entry consists of XML files with 96,290 lexical units (nouns, verbs, adjectives, and adverbs) from the Sloleks Morphological Lexicon of Slovene 2.0 (http://hdl.handle.net/11356/1230) that include codes for morphological patterns.
The pattern codes were designed based on a manual analysis of automatically extracted paradigms and were obtained as follows: The lexical units from Sloleks 2.0 were first automatically clustered into groups through a rule-based approach based on (1) a number of predetermined grammatical features from the MULTEXT-East Version 6 morphosyntactic specifications for Slovenian (http://nl.ijs.si/ME/V6/), such as part of speech, gender and properness for nouns, aspect for verbs, and (2) the differentiating characteristics of their morphological paradigms (i.e. their mutable word parts, which are similar to but not always overlapping with the linguistic definition of word endings – for example: čas-Ø; čas-a; čas-om / prijatelj- Ø; prijatelj-a; prijatelj-em / odstot-ek; odstot-ka; odstot-kom).
More than 1,000 automatically extracted pattern candidates were subsequently linguistically analyzed, combined into groups, and hierarchically organized. As a result, every lexical unit in the XML file features a code (listed as
Because the patterns were extracted from Sloleks 2.0, they reflect the decisions that were implemented in its initial compilation, particularly in terms of the degree of morphological variation documented in the lexicon (e.g. not all morphological variants are necessarily included in the lexicon) and paradigm integrity (for instance, some nouns in Sloleks 2.0 only feature singular or plural forms). It should be noted that non-standard word forms were not included in the design of the patterns. In addition, the XML file does not contain lexical units from Sloleks 2.0 that consist of word forms from more than one morphological paradigm (e.g. lesketati – lesketam / leskečem; or lojen – lojenega / lojnega), or other problematic units (such as those with missing or erroneous data).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.
Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.
[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.
[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006
Diachronica models
Training data
Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:
Classical subcorpus
Hellenistic subcorpus
Whole corpus
Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.
Word2Vec
Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.
Syntactic word embeddings
Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.
ALP models
Training data
Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.
Word2Vec
Software used: Gensim library (Řehůřek and Sojka, 2010)
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.
References
Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.
Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.
Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).
Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.
Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.
Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013
Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The package implements nonparametric (smooth) regression for spherical data in , and is freely available from the Comprehensive Archive Network (CRAN), licensed under the MIT License. It can be used for regression when both the response and explanatory variables lie on the unit sphere. The model uses a flexible kernel-type regression determined by a rotation which depends on a smoothing parameter as well as the prediction point. A particular kernel is proposed and a smoothing parameter selection procedure is also provided. Finally, some examples are included in the package.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Chiral medium-sized rings, albeit displaying attractive properties for drug development, suffer from numerous synthetic challenges due to difficult cyclization steps that must take place to form these unusually strained, atropisomeric rings from sterically crowded precursors. In fact, catalytic enantioselective cyclization methods for the formation of chiral seven-membered rings are unknown, and the corresponding eight-membered variants are also sparse. In this work, we present a substrate preorganization-based, enantioselective, organocatalytic strategy to construct seven- and eight-membered rings featuring chirality that is intrinsic to the ring in the absence of singular stereogenic atoms or single bond axes of chirality. The reactions proceed under mild conditions and with high levels of stereocontrol. Notably, the same bifunctional iminophosphorane chiral catalyst orchestrates the cyclization of substrates of two different ring sizes, under two different mechanistic paradigms. We envision that the mechanistic and ring size versatility of this method could guide further applications of asymmetric catalysis to other challenging cyclization reactions.
https://www.educacionyfp.gob.es/comunes/aviso-legal.htmlhttps://www.educacionyfp.gob.es/comunes/aviso-legal.html
This section offers the main results obtained from the Statistical Exploitation of the National Census of Sports Facilities 2005. The project, elaborated by the Consejo Superior de Deportes, counts on the collaboration of the competent units in the matter of the autonomous communities and cities.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Dataset description: The General Regionally Annotated Corpus of Ukrainian (GRAC, Shvedova et al. 2017-2024, uacorpus.org) was consulted to collect data for further analysis concerning the distribution of Singular vs. Plural verb forms in the target bahato construction. GRAC is a Sketch Engine corpus of over 1.8 billion words, representing texts from over 30,000 authors created between 1816 and 2023. This corpus is designed to serve as source material for linguistic research on Standard Ukrainian. Our data was collected during the month of February 2024. We extracted and annotated 28,491 examples of the bahato construction. An additional set of examples was collected from the Russian National Corpus (ruscorpora.ru) during the month of August 2024 to provide comparison with the Russian mnogo construction. For this purpose, 6,612 examples were extracted and annotated for word order and Singular vs. Plural verb agreement. Both the Ukrainian and the Russian data are included in this dataset, along with the R scripts used to analyze this data. Article abstract: We reveal an ongoing language change in Ukrainian involving a construction with a subject comprised of the indefinite quantifier багато ‘many’ modifying a noun phrase in the Genitive Plural. Number agreement on the verb varies, allowing both Singular (in 69.1% of attestations) and Plural (in 30.9% of attestations). Based on statistical analysis of corpus data, we investigate the influence of the factors of year of creation, word order of subject and verb, and animacy of the subject on the choice of verb number. We find that, while all combinations of word order and animacy are robustly attested, VS word order and inanimate subjects tend to prefer Singular, whereas SV word order and animate subjects tend to prefer Plural. Since about the 1950s, the proportion of Plural has been increasing, overtaking Singular in the current decade. We propose that this Singular vs. Plural variation is motivated by the human embodied experience of construing a group of items as either a homogeneous mass (and therefore Singular) or a multiplicity of individuals (and therefore Plural). This proposal is supported by the identification of micro-constructions that prefer Singular and show reduced individuation of human beings.