18 datasets found

d
Replication Data for: Understanding ‘many’ through the lens of Ukrainian...
search-demo.dataone.org
dataverse.no
+1more
Updated Sep 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janda, Laura Alexis (2024). Replication Data for: Understanding ‘many’ through the lens of Ukrainian багато [Dataset]. http://doi.org/10.18710/Y7VGQE
Explore at:
Unique identifier
https://doi.org/10.18710/Y7VGQE
Dataset updated
Sep 25, 2024
Dataset provided by
DataverseNO
Authors
Janda, Laura Alexis
Time period covered
Jan 1, 1742 - Jan 1, 2023
Description
Dataset description: The General Regionally Annotated Corpus of Ukrainian (GRAC, Shvedova et al. 2017-2024, uacorpus.org) was consulted to collect data for further analysis concerning the distribution of Singular vs. Plural verb forms in the target bahato construction. GRAC is a Sketch Engine corpus of over 1.8 billion words, representing texts from over 30,000 authors created between 1816 and 2023. This corpus is designed to serve as source material for linguistic research on Standard Ukrainian. Our data was collected during the month of February 2024. We extracted and annotated 28,491 examples of the bahato construction. An additional set of examples was collected from the Russian National Corpus (ruscorpora.ru) during the month of August 2024 to provide comparison with the Russian mnogo construction. For this purpose, 6,612 examples were extracted and annotated for word order and Singular vs. Plural verb agreement. Both the Ukrainian and the Russian data are included in this dataset, along with the R scripts used to analyze this data. Article abstract: We reveal an ongoing language change in Ukrainian involving a construction with a subject comprised of the indefinite quantifier багато ‘many’ modifying a noun phrase in the Genitive Plural. Number agreement on the verb varies, allowing both Singular (in 69.1% of attestations) and Plural (in 30.9% of attestations). Based on statistical analysis of corpus data, we investigate the influence of the factors of year of creation, word order of subject and verb, and animacy of the subject on the choice of verb number. We find that, while all combinations of word order and animacy are robustly attested, VS word order and inanimate subjects tend to prefer Singular, whereas SV word order and animate subjects tend to prefer Plural. Since about the 1950s, the proportion of Plural has been increasing, overtaking Singular in the current decade. We propose that this Singular vs. Plural variation is motivated by the human embodied experience of construing a group of items as either a homogeneous mass (and therefore Singular) or a multiplicity of individuals (and therefore Plural). This proposal is supported by the identification of micro-constructions that prefer Singular and show reduced individuation of human beings.
Computing a partial elastic shape registration of 3D surfaces using dynamic...
data.nist.gov
catalog.data.gov
Updated Oct 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Computing a partial elastic shape registration of 3D surfaces using dynamic programming [Dataset]. http://doi.org/10.18434/mds2-3056
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-3056, https://identifiers.org/ark:/88434/mds2-3056
Dataset updated
Oct 30, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
Fortran and Matlab programs, Matlab mex file of Fortran program, compiled mex file, and sample data files, etc. for computing a partial elastic shape registration of two simple surfaces in 3-dimensional space and the elastic shape distance between them corresponding to the partial registration.
w
Book subjects where books includes Singular quadratic forms in perturbation...
workwithdata.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data, Book subjects where books includes Singular quadratic forms in perturbation theory [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=includes&fval0=Singular+quadratic+forms+in+perturbation+theory&j=1&j0=books
Explore at:
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects and is filtered where the books includes Singular quadratic forms in perturbation theory, featuring 10 columns including authors, average publication date, book publishers, book subject, and books. The preview is ordered by number of books (descending).
f
The posterior means for singular values for the BGGE and BGGEE models as a...
plos.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luciano Antonio de Oliveira; Carlos Pereira da Silva; Alessandra Querino da Silva; Cristian Tiago Erazo Mendes; Joel Jorge Nuvunga; Joel Augusto Muniz; Júlio Sílvio de Sousa Bueno Filho; Marcio Balestre (2023). The posterior means for singular values for the BGGE and BGGEE models as a function of the number of bilinear terms (k = 1,2,…, 7). [Dataset]. http://doi.org/10.1371/journal.pone.0256882.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0256882.t001
Dataset updated
Jun 9, 2023
Dataset provided by
PLOS ONE
Authors
Luciano Antonio de Oliveira; Carlos Pereira da Silva; Alessandra Querino da Silva; Cristian Tiago Erazo Mendes; Joel Jorge Nuvunga; Joel Augusto Muniz; Júlio Sílvio de Sousa Bueno Filho; Marcio Balestre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The posterior means for singular values for the BGGE and BGGEE models as a function of the number of bilinear terms (k = 1,2,…, 7).
GI625 optical fiber data imaged on a Zeiss Versa XRM-500 microCT at 12 tube...
catalog.data.gov
data.nist.gov
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GI625 optical fiber data imaged on a Zeiss Versa XRM-500 microCT at 12 tube voltages [Dataset]. https://catalog.data.gov/dataset/gi625-optical-fiber-data-imaged-on-a-zeiss-versa-xrm-500-microct-at-12-tube-voltages-c67ab
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is tomography data as acquired using a commercial X-ray tomography instrument. We obtained reconstructions of a graded-index optical fiber with voxels of edge length 1.05 µm at 12 tube voltages. The fiber manufacturer created a graded index in the central region by varying the germanium concentration from a peak value in the center of the core to a very small value at the core-cladding boundary. Operating on 12 tube voltages, we show by a singular value decomposition that there are only two singular vectors with significant weight. Physically, this means scans beyond two tube voltages contain largely redundant information. We concentrate on an analysis of the images associated with these two singular vectors. The first singular vector is dominant and images of the coefficients of the first singular vector at each voxel look are similar to any of the single-energy reconstructions. Images of the coefficients of the second singular vector by itself appear to be noise. However, by averaging the reconstructed voxels in each of several narrow bands of radii, we can obtain values of the second singular vector at each radius. In the core region, where we expect the germanium doping to go from a peak value at the fiber center to zero at the core-cladding boundary, we find that a plot of the two coefficients of the singular vectors forms a line in the two-dimensional space consistent with the dopant decreasing linearly with radial distance from the core center. The coating, made of a polymer rather than silica, is not on this line indicating that the two-dimensional results are sensitive not only to the density but also to the elemental composition. A stack of reconstructions are given here as tiff files of individual slices. Each zip file corresponds to a tilt series at a given tube voltage, given in the file name. The power is also given in the file name. (For example, file “30kV-2W.zip” was tube voltage at 30kV, power 2W.) The power was varied so that the signal-to-noise was approximately equal for the various reconstructions. The experiment is described in: ZH Levine, AP Peskin, EJ Garboczi, and AD Holmgren, Multi-Energy X-Ray Tomography of an Optical Fiber: The Role of Spatial Averaging, Microscopy and Microanalysis 25 (1) 70-76 (2019). https://doi.org/10.1017/S1431927618016136
c
Data from: Production of Dutch variable plurals in language corpora
datacatalogue.cessda.eu
ssh.datastations.nl
Updated Apr 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T.J. Zee; L.F.M. ten Bosch; I. Plag; M.T.C. Ernestus (2023). Production of Dutch variable plurals in language corpora [Dataset]. http://doi.org/10.17026/dans-xvr-qscf
Explore at:
Unique identifier
https://doi.org/10.17026/dans-xvr-qscf
Dataset updated
Apr 11, 2023
Dataset provided by
Radboud University
Authors
T.J. Zee; L.F.M. ten Bosch; I. Plag; M.T.C. Ernestus
Description
A growing body of work in psycholinguistics suggests that morphological relations between word forms affect the processing of complex words. Previous studies have usually focused on a particular type of paradigmatic relation, for example the relation between paradigm members, or the relation between alternative forms filling a particular paradigm cell. However, potential interactions between different types of paradigmatic relations have remained relatively unexplored. The data in in this data set were used in two corpus studies of variable plurals in Dutch to test hypotheses about potentially interacting paradigmatic effects. The first study (which uses the s_dist data) shows that generalization across noun paradigms predicts the distribution of plural variants, and that this effect is diminished for paradigms in which the plural variants are more likely to have a strong representation in the mental lexicon. The second study (which uses the s_dur data) demonstrates that the pronunciation of a target plural variant is affected by coactivation of the alternative variant, resulting in shorter segmental durations. This effect is dependent on the representational strength of the alternative plural variant. In sum, the distributional and durational measurements in these data provide evidence that storage of morphologically complex words may affect the role of generalization and coactivation during production. A full description of the data gathering process and the analyses is given in the Methodology file. The Readme file describes how the remaining files relate to the research.
Z
Dataset: The plural interpretability of German linking elements...
data.niaid.nih.gov
live.european-language-grid.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schäfer, Roland (2020). Dataset: The plural interpretability of German linking elements ("Morphology") [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1322790
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Pankratz, Elizabeth
Schäfer, Roland
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset accompanies a paper to be published in "Morphology" (JOMO, Springer). Under the present DOI, all data generated for this research as well as all scripts used are stored. The paper itself is not CC-licensed, refer to Springer's "Morphology" website for details!

Abstract

In this paper, we take a closer theoretical and empirical look at the linking elements in German N1+N2 compounds which are identical to the plural marker of N1 (such as -er with umlaut, as in Häus-er-meer 'sea of houses'). Various perspectives on the actual extent of plural interpretability of these pluralic linking elements are expressed in the literature. We aim to clarify this question by empirically examining to what extent there may be a relationship between plural form and meaning which informs in which sorts of compounds pluralic linking elements appear. Specifically, we investigate whether pluralic linking elements occur especially frequently in compounds where a plural meaning of the first constituent is induced either externally (through plural inflection of the entire compound) or internally (through a relation between the constituents such that N2 forces N1 to be conceptually plural, as in the example above). The results of a corpus study using the DECOW16A corpus and a split-100 experiment show that in the internal but not external plural meaning conditions, a pluralic linking element is preferred over a non-pluralic one, though there is considerable inter-speaker variability, and limitations imposed by other constraints on linking element distribution also play a role. However, we show the overall tendency that German language users do use pluralic linking elements as cues to the plural interpretation of N1+N2 compounds. Our interpretation does not reference a specific morphological framework. Instead, we view our data as strengthening the general approach of probabilistic morphology.
c
Modelling word learning and recognition using visually grounded speech
datacatalogue.cessda.eu
ssh.datastations.nl
Updated Jan 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D.G.M. Merkx; S.L. Frank; O.E. Scharenborg; M.T.C. Ernestus; S. Scholten (2025). Modelling word learning and recognition using visually grounded speech [Dataset]. http://doi.org/10.17026/dans-22n-xh47
Explore at:
Unique identifier
https://doi.org/10.17026/dans-22n-xh47
Dataset updated
Jan 17, 2025
Dataset provided by
Radboud University
Authors
D.G.M. Merkx; S.L. Frank; O.E. Scharenborg; M.T.C. Ernestus; S. Scholten
Description
A set of recorded isolated nouns, verbs and image annotations used for testing the word recognition performance of our speech2image model.
We trained a word recognition model on a set of images and utterances. The model should learn to recognise words without ever having seen written transcripts. The word recognition performance is measured as the number of retrieved images out of 10 displaying the correct visual referent.
We furthermore collected new ground truth object and action annotations for the Flickr8k test images for this purposes. This consists of 1000 images, all annotated for the presence of the 50 actions and objects corresponding to the test verbs and nouns.
In order to test the word recognition performance we took the 50 most common nouns and 50 most common verbs in the training data, confirmed that there were at least 10 images in our test image data that displayed these actions and objects. These nouns and verbs where recorded in singular and plural form (nouns) and in root, third person and progressive form (verbs). We furthermore annotated 1000 images from the Flickr8k test set for the presence of these nouns and verbs. These annotations are included in .CSV format
n
Data from: The usage of landscape ecological concepts in the planning...
access.earthdata.nasa.gov
recerca.uoc.edu
+3more
Updated Oct 25, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). The usage of landscape ecological concepts in the planning literature [Dataset]. http://doi.org/10.16904/envidat.254
Explore at:
Unique identifier
https://doi.org/10.16904/envidat.254
Dataset updated
Oct 25, 2021
Time period covered
Jan 1, 2021
Area covered

Description
Table of content: 1. Frequency of early concepts; 2. Frequency of additional concepts; 3. Use of any early concept; 4. Use of any additional concept, 5. Planning steps; 6. Protocol. The present dataset is part of the published scientific paper entitled “Landscape ecological concepts in planning: review of recent developments” (Hersperger et al., 2021). The goal of this research was to review recent publications to assess the use of landscape ecological concepts in planning. Specifically, we address the following research questions: Q1. Landscape ecological concepts: What are they? How frequently are they mentioned in current research? Q2. How are landscape ecological concepts integrated in landscape planning? We analysed all empirical and overview papers that have been published in four key academic journals in the field of landscape ecology and landscape planning in the years 2015–2019 (n = 1918). Four key journals in the field of landscape ecology were selected to conduct the analysis, respectively Landscape Ecology (LE), Landscape Online (LO), Current Landscape Ecology Reports (CLER), and Landscape and Urban Planning (LUP). The title, abstract and keywords of all papers were read in order to identify landscape ecological concepts. Then, all 1918 papers went through a keyword search to identify the use of early and additional concepts. We used the “pdfsearch” package in R programming language and searched for singular and plural forms and different variations of the concepts (see Supplementary material 1, Table A). As a result, we provided four outputs: 1. Frequency of early concepts. This data provides the total number of times each article used each early concept (Q1). This data was used to produce the Figure 2a at the original publication. 2. Frequency of additional concepts. This data provides the total number of times each article used each additional concept (Q1). This data was used to produce the Figure 2b at the original publication. 3. Use of any early concept. This data provides the total number of times each article used any early concept (Q1). This data was used to produce the Figure 3a at the original publication. 4. Use of any additional concept. This data provides the total number of times each article used any additional concept (Q1). This data was used to produce the Figure 3b at the original publication. To address the second question (Q2), the title, abstract and keywords of the papers included in our sample (n=1918 articles) were screened to identify papers that might show how landscape ecological concepts are integrated into planning. We selected 52 empirical papers (see Supplementary material – 4 Integration of landscape ecological concepts into planning), and we provided two outputs: 5. Planning steps. This data provides the number of times landscape ecological concepts were addressed in each planning steps in 52 empirical papers analysed in detail (Q2). This data was used to produce the Figure 4 at the original publication. 6. Protocol for assessing the integration of landscape ecological concepts into planning. To systematically collect the data, we used this protocol which addressed the following questions: (a) which type of planning is addressed by the paper? (b) to which planning level does the paper refer to? (c) which concepts are integrated in any of the planning steps described above?
E
CORDEX inflectional lookup data 1.0
live.european-language-grid.eu
marketplace.sshopencloud.eu
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). CORDEX inflectional lookup data 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/22996
Explore at:
Dataset updated
Sep 7, 2023
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms.

Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").
u
Widefield imaging data from the publication, Cortical State Fluctuations...
rdr.ucl.ac.uk
zip
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elina Jacobs (2023). Widefield imaging data from the publication, Cortical State Fluctuations during Sensory Decision Making [Dataset]. http://doi.org/10.5522/04/13194452.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5522/04/13194452.v1
Dataset updated
Jun 1, 2023
Dataset provided by
University College London
Authors
Elina Jacobs
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This site contains the widefield imaging datasets from the publication Cortical State Fluctuations during Sensory Decision Making, by Jacobs et al in Current Biology.This data is from the behavioural tasks described in the publication, and is in a compressed SVD format (see Methods in the publication for more details). The companion code is designed to take the data in this format.The datasets provided here contain the top 500 singular values, which is how the data in the publication was analysed, as this was found to sufficiently capture the data. The data contaning up to 2000 singular values can be shared on request.The timestamps of the datasets here are not all aligned with the behavioural datasets; the companion code takes care of this.The data is organised by experimental subject; most subjects were recorded from on multiple days, which form subfolders within the subject folder. Within a day, there may have been several experiments, which again form subfolders within the day folder. The companion code expects this data organisation.The companion code is available at: https://github.com/eakjacobs/Jacobs_et_al_CurrentBiologyFor more information and links to the behavioural and pupil datasets, please follow this link: https://doi.org/10.6084/m9.figshare.13084805The research article can be found (freely available) at https://www.cell.com/current-biology/fulltext/S0960-9822(20)31437-8
c
Matches and Mismatches in Nominal Morphology and Agreement: Learning from...
datacatalogue.cessda.eu
beta.ukdataservice.ac.uk
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brown, D; Sagna, S (2025). Matches and Mismatches in Nominal Morphology and Agreement: Learning from the Acquisition of Eegimaa, 2017-2020 [Dataset]. http://doi.org/10.5255/UKDA-SN-855042
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-855042
Dataset updated
Mar 5, 2025
Dataset provided by
University of York
Authors
Brown, D; Sagna, S
Time period covered
Apr 1, 2017 - Nov 30, 2020
Area covered
Senegal
Variables measured
Family, Family: Household family
Measurement technique
Naturalistic data collection, video and recording of child interactions in the Mof Ávvi villages, near Ziguinchor.Our study is composed of a longitudinal and Cross-sectional study. In the longitudinal study, we follow 6 children from age 1;10 to 4;0, recording them every 15 days. Most research in child language acquisition has been carried out in the Global North, and longitudinal studies tend to follow one or two children. For our cross-sectional research, we record 10 children once at 3;0 and then at 4;0, with the aim of comparing their language production to those of the longitudinal group at the same age points. In addition to these target children, recordings contain a variety of participants including multiple caregivers, multiple playmates and members of the communities who interact with children on a regular basis. Children studied in this research learn to speak in a polyadic environment.
Description
This archive contains a unique collection of naturalistic child language data collected between 2017 and 2020 in Southern Senegal. The deposit contains ELAN files of annotated data based on recordings of children's production and child directed speech in naturalistic settings. The language under investigation is Eegimaa, a Jóola language of sourthern Senegal. This is part of the Atlantic branch of the Niger‑Congo Phylum. The data was collected as part of a research project which investigates the acquisition of an Atlantic noun class system. Our research looks at the factors underlying children’s learning of nominal class prefixes and syntactic and semantic agreement at the level of the NP.

We focus on questions including the following.

• Which elements of noun class morphology do children begin to use productively?

• What is the role of input frequency, morphological salience, and transparency in children acquisition of noun class and agreement in Eegimaa?

• Are errors in the production of nominal class prefixes also reflected in children’s use of the corresponding agreement markers?
Theoretical accounts of the strategies used by children to learn the structures of words and grammatical features of languages differ considerably, but our knowledge of what is possible is limited by the existing focus on a relatively small number of languages associated with industrialised nations. Here, we will investigate grammatical features and structures that may be expressed in a variety of different ways. Examples of grammatical features include number, e.g. the distinction between singular and plural, or gender, e.g. distinguishing masculine and feminine in languages like French, features expressed within the shape of the word and associated items. Grammatical structure may be manifested in agreement across the separate words of a noun phrase (e.g. The cat purrs, where the -s on 'purrs' shows agreement with cat, indicating that there is only one cat.) This project investigates the acquisition of inflectional morphology, i.e., grammatical features and structures as reflected in the word forms and associated agreement, in Gújjolaay Eegimaa, a language of the Atlantic family of the Niger Congo phylum spoken in Southern Senegal. This language has a gender system of the type traditionally known as a noun class system. Noun class systems with complex gender agreement are characteristic of the Niger-Congo languages. In Eegimaa nouns use prefixes to form singular and plural. For example ba- is the singular marker for ba-ginh 'chest', but its plural marker is u- as in u-ginh 'chests'. Nouns which have the same singular prefix, e.g. ba-, can form their plural with a different marker (e.g., bá-jur 'young woman', plural sú-jur 'young women'). Eegimaa has a complex morphological system of gender and number marking which is also reflected in its agreement system. Current knowledge as to how children acquire gender/noun class marking and agreement is based entirely on the Bantu languages of the Niger Congo family. There are no studies available of Atlantic languages, which, though similar to Bantu in some ways, also have important differences. Here we will investigate the influence of the three factors found to affect children's acquisition of noun class morphology and agreement, namely: i) Input frequency, according to which the forms that children hear the most will tend to be acquired first ii) Perceptual salience, according to which more salient forms such as stressed syllables will tend to be acquired first, and iii) Morphological transparency, according to which forms whose meanings are easily determined will tend to be acquired more easily than those whose meanings are more obscure. Our study will build on findings on the acquisition of Bantu noun class systems, and will aim to answer questions such as the following. What strategies do children rely on to learn complex language structure? What is the role of adult input language in the acquisition of morphology and agreement in Eegimaa? How do children cope with variation in language input from their caregivers? In what order do they learn the different noun class markers? We will carry out a longitudinal study in which we will observe over three years the interactions of five children aged from about 2 to 4 years with their caregivers. Among Eegimaa speakers, caregivers include children's parents, older siblings and other members of the community. Children's daytime activities mostly take place outside their homes. We will record children's output speech on audio and video and compare the data with child-directed speech from adults and with adult-directed speech (interactions between adults), collected as part of a previous project. We will also carry out a cross-sectional study by twice observing the speech of ten additional children at two points, at ages 3 and 4 years. These studies together will provide both an in-depth look and a broader overview of the...
D
Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling...
dataverse.no
dataverse.azure.uit.no
+1more
bin, text/tsv, txt
Updated Oct 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Sönning; Lukas Sönning (2023). Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling from hierarchically structured corpus data [Dataset]. http://doi.org/10.18710/5KCE4U
Explore at:
bin(13462), text/tsv(2120816), txt(12381)Available download formats
Unique identifier
https://doi.org/10.18710/5KCE4U
Dataset updated
Oct 24, 2023
Dataset provided by
DataverseNO
Authors
Lukas Sönning; Lukas Sönning
License
https://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/5KCE4Uhttps://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/5KCE4U
Time period covered
Jan 1, 1500 - Dec 31, 1707
Area covered
United Kingdom
Description
Dataset description This dataset, which is adapted from Jenset and McGillivray (2017), contains tabular files documenting the alternating usage of -(e)th and -(e)s to mark third-person verb inflection in Early Modern English. The data provided by Jenset and McGillivray (2017) are drawn from the PPCEME corpus (Kroch et al. 2004) and cover the period from 1500 to 1700. In total, 13,757 third-person singular tokens (excluding the verb BE) were annotated by these authors for a range of variables. For the purposes of the present methodological study, this dataset was reduced to a subset of 11,645 tokens, and the coding of variables was in some parts revised, completed, or modified. The dataset includes information about the Author and Verb Lemma, as well as a number of predictor variables, including Genre, Year, Frequency (of the verb lemma in the third-person singular), Phonological Context (stem-final sound), and the Gender of the author. Abstract for related publication Resource constraints often force researchers to down-size the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: Year, Gender, Genre, Frequency, and Phonological Context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 sub-samples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.
c
Data from: Morphological patterns from the Sloleks 2.0 lexicon 1.0
clarin.si
live.european-language-grid.eu
Updated Oct 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Špela Arhar Holdt; Jaka Čibej; Cyprian Laskowski; Simon Krek (2022). Morphological patterns from the Sloleks 2.0 lexicon 1.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1411?show=full
Explore at:
Dataset updated
Oct 26, 2022
Authors
Špela Arhar Holdt; Jaka Čibej; Cyprian Laskowski; Simon Krek
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This entry consists of XML files with 96,290 lexical units (nouns, verbs, adjectives, and adverbs) from the Sloleks Morphological Lexicon of Slovene 2.0 (http://hdl.handle.net/11356/1230) that include codes for morphological patterns.

The pattern codes were designed based on a manual analysis of automatically extracted paradigms and were obtained as follows: The lexical units from Sloleks 2.0 were first automatically clustered into groups through a rule-based approach based on (1) a number of predetermined grammatical features from the MULTEXT-East Version 6 morphosyntactic specifications for Slovenian (http://nl.ijs.si/ME/V6/), such as part of speech, gender and properness for nouns, aspect for verbs, and (2) the differentiating characteristics of their morphological paradigms (i.e. their mutable word parts, which are similar to but not always overlapping with the linguistic definition of word endings – for example: čas-Ø; čas-a; čas-om / prijatelj- Ø; prijatelj-a; prijatelj-em / odstot-ek; odstot-ka; odstot-kom).

More than 1,000 automatically extracted pattern candidates were subsequently linguistically analyzed, combined into groups, and hierarchically organized. As a result, every lexical unit in the XML file features a code (listed as

Because the patterns were extracted from Sloleks 2.0, they reflect the decisions that were implemented in its initial compilation, particularly in terms of the degree of morphological variation documented in the lexicon (e.g. not all morphological variants are necessarily included in the lexicon) and paradigm integrity (for instance, some nouns in Sloleks 2.0 only feature singular or plural forms). It should be noted that non-standard word forms were not included in the design of the patterns. In addition, the XML file does not contain lexical units from Sloleks 2.0 that consist of word forms from more than one morphological paradigm (e.g. lesketati – lesketam / leskečem; or lojen – lojenega / lojnega), or other problematic units (such as those with missing or erroneous data).
Z
Data from: Ancient Greek language models
data.niaid.nih.gov
zenodo.org
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stopponi (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
Explore at:
Dataset updated
Apr 29, 2024
Dataset provided by
McGillivray
Peels-Matthey
Pedrazzini
Stopponi
Nissim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

Diachronica models

Training data

Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

Classical subcorpus

Hellenistic subcorpus

Whole corpus

Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

Models

Count-based

Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

Word2Vec

Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

Syntactic word embeddings

Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

ALP models

Training data

Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

Models

Count-based

Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

Word2Vec

Software used: Gensim library (Řehůřek and Sojka, 2010)

a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

References

Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
f
Data from: The package: nonparametric regression using local rotation...
tandf.figshare.com
text/x-tex
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Di Marzio; Stefania Fensore; Giovanni Lafratta; Charles C. Taylor (2023). The package: nonparametric regression using local rotation matrices in [Dataset]. http://doi.org/10.6084/m9.figshare.14387361.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14387361.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Marco Di Marzio; Stefania Fensore; Giovanni Lafratta; Charles C. Taylor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The package implements nonparametric (smooth) regression for spherical data in , and is freely available from the Comprehensive Archive Network (CRAN), licensed under the MIT License. It can be used for regression when both the response and explanatory variables lie on the unit sphere. The model uses a flexible kernel-type regression determined by a rotation which depends on a smoothing parameter as well as the prediction point. A particular kernel is proposed and a smoothing parameter selection procedure is also provided. Finally, some examples are included in the package.
Data from: Enantiocontrolled Cyclization to Form Chiral 7- and 8‑Membered...
acs.figshare.com
zip
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolò Tampellini; Brandon Q. Mercado; Scott J. Miller (2025). Enantiocontrolled Cyclization to Form Chiral 7- and 8‑Membered Rings Unified by the Same Catalyst Operating with Different Mechanisms [Dataset]. http://doi.org/10.1021/jacs.4c17080.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/jacs.4c17080.s002
Dataset updated
Jan 23, 2025
Dataset provided by
ACS Publications
Authors
Nicolò Tampellini; Brandon Q. Mercado; Scott J. Miller
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Chiral medium-sized rings, albeit displaying attractive properties for drug development, suffer from numerous synthetic challenges due to difficult cyclization steps that must take place to form these unusually strained, atropisomeric rings from sterically crowded precursors. In fact, catalytic enantioselective cyclization methods for the formation of chiral seven-membered rings are unknown, and the corresponding eight-membered variants are also sparse. In this work, we present a substrate preorganization-based, enantioselective, organocatalytic strategy to construct seven- and eight-membered rings featuring chirality that is intrinsic to the ring in the absence of singular stereogenic atoms or single bond axes of chirality. The reactions proceed under mild conditions and with high levels of stereocontrol. Notably, the same bifunctional iminophosphorane chiral catalyst orchestrates the cyclization of substrates of two different ring sizes, under two different mechanistic paradigms. We envision that the mechanistic and ring size versatility of this method could guide further applications of asymmetric catalysis to other challenging cyclization reactions.
e
Singular sports venues by type, by autonomous community
data.europa.eu
unknown
Updated Feb 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministerio de Educación y Formación Profesional (2025). Singular sports venues by type, by autonomous community [Dataset]. https://data.europa.eu/data/datasets/https-datos-gob-es-catalogo-e05024101-espacios_deportivos_singulares_tipologia_segun_comunidad_autonoma?locale=bg
Explore at:
unknownAvailable download formats
Dataset updated
Feb 1, 2025
Dataset authored and provided by
Ministerio de Educación y Formación Profesional
License
https://www.educacionyfp.gob.es/comunes/aviso-legal.htmlhttps://www.educacionyfp.gob.es/comunes/aviso-legal.html
Description
This section offers the main results obtained from the Statistical Exploitation of the National Census of Sports Facilities 2005. The project, elaborated by the Consejo Superior de Deportes, counts on the collaboration of the competent units in the matter of the autonomous communities and cities.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Janda, Laura Alexis (2024). Replication Data for: Understanding ‘many’ through the lens of Ukrainian багато [Dataset]. http://doi.org/10.18710/Y7VGQE

Replication Data for: Understanding ‘many’ through the lens of Ukrainian багато

Explore at:

Unique identifier

https://doi.org/10.18710/Y7VGQE

Dataset updated

Sep 25, 2024

Dataset provided by

DataverseNO

Authors

Janda, Laura Alexis

Time period covered

Jan 1, 1742 - Jan 1, 2023

Description

Dataset description: The General Regionally Annotated Corpus of Ukrainian (GRAC, Shvedova et al. 2017-2024, uacorpus.org) was consulted to collect data for further analysis concerning the distribution of Singular vs. Plural verb forms in the target bahato construction. GRAC is a Sketch Engine corpus of over 1.8 billion words, representing texts from over 30,000 authors created between 1816 and 2023. This corpus is designed to serve as source material for linguistic research on Standard Ukrainian. Our data was collected during the month of February 2024. We extracted and annotated 28,491 examples of the bahato construction. An additional set of examples was collected from the Russian National Corpus (ruscorpora.ru) during the month of August 2024 to provide comparison with the Russian mnogo construction. For this purpose, 6,612 examples were extracted and annotated for word order and Singular vs. Plural verb agreement. Both the Ukrainian and the Russian data are included in this dataset, along with the R scripts used to analyze this data. Article abstract: We reveal an ongoing language change in Ukrainian involving a construction with a subject comprised of the indefinite quantifier багато ‘many’ modifying a noun phrase in the Genitive Plural. Number agreement on the verb varies, allowing both Singular (in 69.1% of attestations) and Plural (in 30.9% of attestations). Based on statistical analysis of corpus data, we investigate the influence of the factors of year of creation, word order of subject and verb, and animacy of the subject on the choice of verb number. We find that, while all combinations of word order and animacy are robustly attested, VS word order and inanimate subjects tend to prefer Singular, whereas SV word order and animate subjects tend to prefer Plural. Since about the 1950s, the proportion of Plural has been increasing, overtaking Singular in the current decade. We propose that this Singular vs. Plural variation is motivated by the human embodied experience of construing a group of items as either a homogeneous mass (and therefore Singular) or a multiplicity of individuals (and therefore Plural). This proposal is supported by the identification of micro-constructions that prefer Singular and show reduced individuation of human beings.

Clear search

Close search

Google apps

Main menu

Replication Data for: Understanding ‘many’ through the lens of Ukrainian...

Computing a partial elastic shape registration of 3D surfaces using dynamic...

Book subjects where books includes Singular quadratic forms in perturbation...

The posterior means for singular values for the BGGE and BGGEE models as a...

GI625 optical fiber data imaged on a Zeiss Versa XRM-500 microCT at 12 tube...

Data from: Production of Dutch variable plurals in language corpora

Dataset: The plural interpretability of German linking elements...

Modelling word learning and recognition using visually grounded speech

Data from: The usage of landscape ecological concepts in the planning...

CORDEX inflectional lookup data 1.0

Widefield imaging data from the publication, Cortical State Fluctuations...

Matches and Mismatches in Nominal Morphology and Agreement: Learning from...

Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling...

Data from: Morphological patterns from the Sloleks 2.0 lexicon 1.0

Data from: Ancient Greek language models

Data from: The package: nonparametric regression using local rotation...

Data from: Enantiocontrolled Cyclization to Form Chiral 7- and 8‑Membered...

Singular sports venues by type, by autonomous community

Replication Data for: Understanding ‘many’ through the lens of Ukrainian багатоSee More Versions

Replication Data for: Understanding ‘many’ through the lens of Ukrainian багато