Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many complex diseases are caused by a variety of both genetic and environmental factors acting in conjunction. To help understand these relationships, nonparametric methods that use aggregate learning have been developed such as random forests and conditional forests. Molinaro et al. (2010) described a powerful, single model approach called partDSA that has the advantage of producing interpretable models. We propose two extensions to the partDSA algorithm called bagged partDSA and boosted partDSA. These algorithms achieve higher prediction accuracies than individual partDSA objects through aggregating over a set of partDSA objects. Further, by using partDSA objects in the ensemble, each base learner creates decision rules using both “and” and “or” statements, which allows for natural logical constructs. We also provide four variable ranking techniques that aid in identifying the most important individual factors in the models. In the regression context, we compared bagged partDSA and boosted partDSA to random forests and conditional forests. Using simulated and real data, we found that bagged partDSA had lower prediction error than the other methods if the data were generated by a simple logic model, and that it performed similarly for other generating mechanisms. We also found that boosted partDSA was effective for a particularly complex case. Taken together these results suggest that the new methods are useful additions to the ensemble learning toolbox. We implement these algorithms as part of the partDSA R package. Supplementary materials for this article are available online.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global partition management software market is experiencing robust growth, driven by the increasing adoption of cloud computing, virtualization, and the expanding need for efficient data management across diverse organizational structures. The market, estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033. This growth is fueled by several key factors. The rise of large-scale data centers necessitates sophisticated partition management tools for optimal resource allocation and performance. Simultaneously, the growing prevalence of hybrid cloud environments and the need for seamless data migration across platforms are creating significant demand for versatile and reliable software solutions. Furthermore, small and medium-sized enterprises (SMEs) are increasingly adopting these tools to improve data organization and simplify IT management tasks, contributing to market expansion. The web-based segment is currently the leading contributor to market revenue, owing to its accessibility and cost-effectiveness, while the cloud-based segment is anticipated to demonstrate the highest growth rate during the forecast period due to its scalability and enhanced security features. Geographic expansion into rapidly developing economies in Asia Pacific and the Middle East & Africa also contributes to the overall market expansion. However, certain restraints affect market growth. These include the high initial investment costs for advanced software solutions, the availability of free open-source alternatives, and the complexity of managing partitions in heterogeneous environments. Despite these challenges, the ongoing digital transformation across industries and the increasing reliance on data-driven decision-making ensure that the partition management software market will maintain a positive trajectory in the coming years. The market is witnessing a trend towards AI-powered automation in partition management tasks, improving efficiency and reducing the need for specialized IT personnel. Furthermore, vendors are focusing on enhanced user interfaces and improved integration capabilities with other enterprise software, furthering the market's evolution. The continued development of innovative features and solutions will drive substantial growth in the global partition management software market, creating considerable opportunities for industry players.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predictive analytics involves the use of statistical models to make predictions; however, the power of these techniques is hindered by ever-increasing quantities of data. The richness and sheer volume of big data can have a profound effect on computation time and/or numerical stability. In the current study, we develop a novel approach to subsampling with the aim of overcoming this issue when dealing with regression problems in a supervised learning framework. The proposed method integrates stratified sampling and is model-independent. We assess the theoretical underpinnings of the proposed subsampling scheme, and demonstrate its efficacy in yielding reliable predictions with desirable robustness when applied to different statistical models. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. Markov chain Monte Carlo (MCMC) methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain’s convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and nonergodic edge reversal moves. Here, we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally, the method can be combined with edge reversal moves to improve the sampler further. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data files for a research paper, "Ethnicity-Based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning," published in the Journal of the Association for Information Science and Technology.Four zipped files are uploaded.Each zipped file contains five data files: signatures_train.txt, signatures_test.txt, records.txt, clusters_train.txt, and clusters_test.txt.1. 'Signatures' files contain lists of name instances. Each name instance (a row) is associated with information as follows. - 1st column: instance id (numeric): unique id assigned to a name instance - 2nd column: paper id (numeric): unique id assigned to a paper in which the name instance appears as an author name - 3rd column: byline position (numeric): integer indicating the position of the name instance in the authorship byline of the paper - 4th column: author name (string): name string formatted as surname, comma, and forename(s) - 5th column: ethnic name group (string): name ethnicity assigned by Ethnea to the name instance - 6th column: affiliation (string): affiliation associated with the name instance, if available in the original data - 7th column: block (string): simplified name string of the name instance to indicate its block membership (surname and first forename initial) - 8th column: author id (string): unique author id (i.e., author label) assigned by the creators of the original data2. 'Records' files contain lists of papers. Each paper is associated with information as follows. -1st column: paper id (numeric): unique paper id; this is the unique paper id (2nd column) in Signatures files -2nd column: year (numeric): year of publication * Some papers may have wrong publication years due to incorrect indexing or delayed updates in original data -3rd column: venue (string): name of journal or conference in which the paper is published * Venue names can be in full string or in a shortened format according to the formats in original data -4th column: authors (string; separated by vertical bar): list of author names that appear in the paper's byline * Author names are formatted into surname, comma, and forename(s) -5th column: title words (string; separated by space): words in a title of the paper. * Note that common words are stop-listed and each remaining word is stemmed using Porter's stemmer.3. 'Clusters' files contain lists of clusters. Each cluster is associated with information as follows. -1st column: cluster id (numeric): unique id of a cluster -2nd column: list of name instance ids (Signatures - 1st column) that belong to the same unique author id (Signatures - 8th column). Signatures and Clusters files consist of two subsets - train and test files - of original labeled data which are randomly split into 50%-50% by the authors of this study.Original labeled data for AMiner.zip, KISTI.zip, and GESIS.zip came from the studies cited below.If you use one of the uploaded data files, please cite them accordingly.[AMiner.zip]Tang, J., Fong, A. C. M., Wang, B., & Zhang, J. (2012). A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. doi:10.1109/Tkde.2011.13Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active Name Disambiguation. Paper presented at the 2011 IEEE 11th International Conference on Data Mining.[KISTI.zip]Kang, I. S., Kim, P., Lee, S., Jung, H., & You, B. J. (2011). Construction of a Large-Scale Test Set for Author Disambiguation. Information Processing & Management, 47(3), 452-465. doi:10.1016/j.ipm.2010.10.001Note that the original KISTI data contain errors and duplicates. This study reuses the revised version of KISTI reported in a study below.Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867-1886. doi:10.1007/s11192-018-2824-5[GESIS.zip]Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. Paper presented at the 20th international Conference on Theory and Practice of Digital Libraries (TPDL 2016), Hannover, Germany.Note that this study reuses the 'Evaluation Set' among the original GESIS data which was added titles by a study below.Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. doi:10.1002/asi.24298[UM-IRIS.zip]This labeled dataset was created for this study. For description about the labeling method, please see 'Method' in the paper below.Kim, J., Kim, J., & Owen-Smith, J. (In print). Ethnicity-based name partitioning for author name disambiguation using supervised machine learning. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24459.For details on the labeling method and limitations, see the paper below.Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Constructing the Dunhuang mural dataset is the cornerstone of this study, laying a solid data foundation for subsequent image description generation tasks. The entire process systematically encompasses four key steps: data acquisition, data augmentation, data annotation, and data partitioning. While pursuing data breadth and diversity, it also optimizes data quality and application potential through technical means.In the data acquisition phase, we innovatively integrated both online and offline methods. Online, we utilized web scraping technology to efficiently collect a large number of high-resolution mural images from sources such as the official website of the Dunhuang Research Academy, major art exhibition platforms, digital museums, and professional image libraries. Offline, we actively established collaborations with museums, art institutions, and researchers. After obtaining formal permissions, we employed high-resolution photography equipment and scanners to meticulously capture physical murals, ensuring high image quality and rich detail. These two approaches complement each other, jointly creating an image data pool with broad representativeness and high diversity.During the data augmentation stage, to expand the dataset's scale and enhance the model's generalization ability, we introduced a series of advanced image processing techniques. These included geometric transformations (such as rotation, mirroring, and scaling to simulate different perspectives), color adjustments (modifying brightness, contrast, and saturation to adapt to varying lighting conditions), random noise addition (mimicking interference factors in real-world photography), and image synthesis (integrating mural elements into different backgrounds or scenes). These techniques not only effectively simulated the various states in which murals might appear in the real world, significantly enriching the dataset's diversity, but also made the image data encountered during model training more aligned with real-world application scenarios, thereby improving its robustness.Data annotation, as the core component of dataset construction, directly impacts the performance of subsequent models. We specifically invited senior experts in the field of Dunhuang murals to provide professional and detailed textual descriptions for each mural image in the dataset. These descriptions were meticulous, covering not only the specific content, themes, and artistic style characteristics of the murals but also delving into the symbolic meanings of symbols and elements within them. For example, a description like "The flying celestial being on the left wears an Indian-style crown, with the left leg bent forward and the right leg extended backward, holding a scarf in the left hand and a flower tray in the right" accurately records the posture and details of the celestial being. To ensure annotation consistency and high quality, we developed detailed annotation guidelines and conducted multiple rounds of strict quality review and verification for all annotated results.Ultimately, through this complete and closed-loop process of data acquisition, augmentation, annotation, and partitioning, we successfully constructed a high-quality Dunhuang mural dataset. This dataset not only contains rich and diverse image information paired with precise textual descriptions but also provides a solid and reliable data foundation for the training, optimization, and evaluation of deep learning-based image description generation models
Classic evolutionary theory suggests that sexual dimorphism evolves primarily via sexual and fecundity selection. However, theory and evidence is beginning to accumulate suggesting that resource competition can drive the evolution of sexual dimorphism, via ecological character displacement between sexes. A key prediction of this hypothesis is that the extent of ecological divergence between sexes will be associated with the extent of sexual dimorphism.
As the stable isotope ratios of animal tissues provide a quantitative measure of various aspects of ecology, we carried out a meta-analysis examining associations between the extent of isotopic divergence between sexes and the extent of body size dimorphism. Our models demonstrate that large amounts of between-study variation in isotopic (ecological) divergence between sexes is non-random and may be associated with the traits of study subjects. We therefore completed meta-regressions to examine whether the extent of isotopic divergen..., We collated peer-reviewed literature available in the Web of Science Core Collection. The stable isotope literature is large, with the search term “stable isotope†returning ~76 500 studies at the time of writing. To constrain the search, we combined the following specific terms, using the default publication year range of 1900-2020, on 10/11/2020: Isotop* Nich; Isotop Nich* Male; Isotop* Nich* Female; Isotop* Nich* Male Female; Isotop* Nich* Sex Diff*; Isotop Nich* Dimorph; Isotop Dimorph*. Our searches returned 3489 studies, which we placed into a spreadsheet to highlight duplicates for manual removal. Removing duplicates resulted in 2807 studies for title and abstract screening. At this stage, we made the decision to constrain our analysis to the nitrogen and carbon stable isotope systems, due to the relatively small number of studies using other systems that were returned by our search terms. We also rejected studies during title and abstract screening if they did not use bulk stabl..., Miscrosoft Excel
Data Availability: All data supporting the findings of this study, including methodology examples, raw images and z-stack scans, statistical assessments as well as insect species information are all available through Edmond, the Open Access Data Repository of the Max Planck Society, or via the online version of the publication.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global office modular partition systems market, valued at $640 million in 2025, is projected to experience robust growth, driven by a Compound Annual Growth Rate (CAGR) of 5.5% from 2025 to 2033. This expansion is fueled by several key factors. The increasing demand for flexible and adaptable workspace solutions in modern offices is a primary driver. Companies are increasingly prioritizing efficient space utilization and the ability to quickly reconfigure layouts to meet evolving business needs. Furthermore, the growing adoption of modular systems in various sectors, including healthcare (hospitals, clinics) and education (schools, universities), is significantly contributing to market growth. The ease of installation, cost-effectiveness compared to traditional construction, and sustainability benefits of modular partitions are also key selling points. While challenges such as the initial investment costs and potential limitations in design flexibility compared to traditional partitions exist, the overall market outlook remains positive. The market segmentation, with a diverse range of applications (office buildings, hospitals, schools) and material types (glass, metal), indicates a broad appeal and multiple avenues for future growth. Geographical expansion, particularly in developing economies experiencing rapid urbanization and infrastructure development, will likely fuel further market expansion in the coming years. The presence of established players like Steelcase and emerging companies such as Avanti Systems signifies a competitive yet dynamic market landscape. The market's growth trajectory indicates strong potential for investors and businesses involved in the manufacturing, distribution, and installation of office modular partition systems. Further growth is expected to be fueled by technological advancements leading to more sustainable and aesthetically pleasing partition solutions. The focus on creating healthier and more productive workspaces will drive adoption within the office sector, while the increasing demand for flexible learning environments will boost demand from the education sector. This sustained growth will likely see a shift towards more specialized and technologically advanced modular partition systems, offering features like improved acoustics, enhanced security, and integrated technology solutions. Competitive pressures will necessitate continuous innovation and adaptation to meet the evolving needs of a diverse customer base. This comprehensive report provides an in-depth analysis of the global office modular partition systems market, encompassing historical data (2019-2024), current estimates (2025), and future projections (2025-2033). The market is projected to reach several million units by 2033, driven by a range of factors explored within this report. This study offers crucial insights for businesses involved in manufacturing, distribution, and installation of modular partitions, as well as for investors seeking opportunities in this rapidly evolving sector. Key search terms like "modular office partitions," "office partition systems," "glass office partitions," "metal office partitions," and "modular wall systems" are strategically integrated throughout the report to maximize its online visibility.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Marine communities undergo rapid changes because of human-induced ecosystem pressures. The Baltic Sea pelagic food web has experienced several regime shifts during the past century, resulting in a system where competition between planktivorous mesopredators is assumed to be high. While the two clupeids sprat and herring reveal signs of competition, the stickleback population has increased drastically during the past decades. Here, we investigate diet overlap between the three dominating planktivorous fish in the Baltic Sea, utilizing DNA metabarcoding on the 18S rRNA gene and the COI gene, targeted qPCR, and microscopy. Our results show niche differentiation between clupeids and stickleback and that rotifers play an important function in niche partitioning of stickleback, as a resource that is not being used, neither by the clupeids nor by other zooplankton. We further show that all the diet assessment methods used in this study are consistent but DNA metabarcoding describes the plankton-fish link at the highest taxonomic resolution. This study suggests that rotifers and other understudied soft-bodied prey may have an important function in the pelagic food web and that the growing population of pelagic stickleback is supported by the unutilized feeding niche offered by the rotifers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Imputation of well log data is a common task in the field. However a quick review of the literature reveals a lack of padronization when evaluating methods for the problem. The goal of the benchmark is to introduce a standard evaluation protocol to any imputation method for well log data.
In the proposed benchmark, three public datasets are used:
Here you can download all three datasets already preprocessed to be used with our implementation, found here.
There are six files for each fold partition for each dataset.
datasetname_fold_k_well_log_metadata_train.json
: JSON file with general information of the slices of training partition of the fold k. Contains total number of slices and the number of slices per well. datasetname_fold_k_well_log_metadata_val.json
: JSON file with general information of the slices of validation partition of the fold k. Contains total number of slices and the number of slices per well. datasetname_fold_k_well_log_slices_train.npy
: .npy (numpy) file ready to be loaded with the slices for training of the fold k already processed. When loaded should have shape of (total_slices, 256, number_of_logs)datasetname_fold_k_well_log_slices_val.npy
: .npy (numpy) file ready to be loaded with the slices for validation of the fold k already processed.datasetname_fold_k_well_log_slices_meta_train.json
: JSON file with the slices info for all slices in the training partition of the fold k. For each slice, 7 data points are provided, the last four are discarded (it would contain other information that was not used). The first three are in order the: origin well name, the starting position in that well, and the end position of the slice in that well.datasetname_fold_k_well_log_slices_meta_val.json
: JSON file with the slices info for all slices in the validation partition of the fold k.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate estimation of the change in crime over time is a critical first step toward better understanding of public safety in large urban environments. Bayesian hierarchical modeling is a natural way to study spatial variation in urban crime dynamics at the neighborhood level, since it facilitates principled “sharing of information” between spatially adjacent neighborhoods. Typically, however, cities contain many physical and social boundaries that may manifest as spatial discontinuities in crime patterns. In this situation, standard prior choices often yield overly smooth parameter estimates, which can ultimately produce mis-calibrated forecasts. To prevent potential over-smoothing, we introduce a prior that partitions the set of neighborhoods into several clusters and encourages spatial smoothness within each cluster. In terms of model implementation, conventional stochastic search techniques are computationally prohibitive, as they must traverse a combinatorially vast space of partitions. We introduce an ensemble optimization procedure that simultaneously identifies several high probability partitions by solving one optimization problem using a new local search strategy. We then use the identified partitions to estimate crime trends in Philadelphia between 2006 and 2017. On simulated and real data, our proposed method demonstrates good estimation and partition selection performance. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Warning: the ground truth is missing in certain of these datasets. This was fixed in version 1.0.1, which you should use instead.
Description: this corpus was designed as an experimental benchmark for a task of signed graph classification. It is composed of three datasets derived from external sources and adapted to our needs:
These data were used in [4] in order to train and assess various representation learning methods. The authors proposed Signed Graph2vec, a signed variant of Graph2vec; WSGCN, a whole-graph variant of Signed Graph Convolutional Networks (SGCN), and use an aggregated version of Signed Network Embeddings (SiNE) as a baseline. The article provides more information regarding the properties of the datasets, and how they were constituted.
Software: the software used to train the representation learning methods and classifiers is publicly available online: SWGE.
References:
Funding: part of this work was funded by a grant from the Provence-Alpes-Côte-d'Azur region (PACA, France) and the Nectar de Code company.
Citation: If you use this data or the associated source code, please cite article [4]:
@Article{Cecillon2024,
author = {Cécillon, Noé and Labatut, Vincent and Dufour, Richard and Arınık, Nejat},
title = {Whole-Graph Representation Learning For the Classification of Signed Networks},
journal = {IEEE Access},
year = {2024},
volume = {12},
pages = {151303-151316},
doi = {10.1109/ACCESS.2024.3472474},
}
This dataset is comprised of NetFlow records, which capture the outbound network traffic of 8 commercial IoT devices and 5 non-IoT devices, collected during a period of 37 days in a lab at Ben-Gurion University of The Negev. The dataset was collected in order to develop a method for telecommunication providers to detect vulnerable IoT models behind home NATs. Each NetFlow record is labeled with the device model which produced it; for research reproducibilty, each NetFlow is also allocated to either the "training" or "test" set, in accordance with the partitioning described in: Y. Meidan, V. Sachidananda, H. Peng, R. Sagron, Y. Elovici, and A. Shabtai, A novel approach for detecting vulnerable IoT devices connected behind a home NAT, Computers & Security, Volume 97, 2020, 101968, ISSN 0167-4048, https://doi.org/10.1016/j.cose.2020.101968. (http://www.sciencedirect.com/science/article/pii/S0167404820302418) Please note: The dataset itself is free to use, however users are requested to cite the above-mentioned paper, which describes in detail the research objectives as well as the data collection, preparation and analysis. Following is a brief description of the features used in this dataset. # NetFlow features, used in the related paper for analysis 'FIRST_SWITCHED': System uptime at which the first packet of this flow was switched 'IN_BYTES': Incoming counter for the number of bytes associated with an IP Flow 'IN_PKTS': Incoming counter for the number of packets associated with an IP Flow 'IPV4_DST_ADDR': IPv4 destination address 'L4_DST_PORT': TCP/UDP destination port number 'L4_SRC_PORT': TCP/UDP source port number 'LAST_SWITCHED': System uptime at which the last packet of this flow was switched 'PROTOCOL': IP protocol byte (6: TCP, 17: UDP) 'SRC_TOS': Type of Service byte setting when there is an incoming interface 'TCP_FLAGS': Cumulative of all the TCP flags seen for this flow # Features added by the authors 'IP': Prefix of the destination IP address, representing the network (without the host) 'DURATION': Time (seconds) between first/last packet switching # Label 'device_model':
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The HapMap (haplotype map) projects have produced valuable genetic resources in life science research communities, allowing researchers to investigate sequence variations and conduct genome-wide association study (GWAS) analyses. A typical HapMap project may require sequencing hundreds, even thousands, of individual lines or accessions within a species. Due to limitations in current sequencing technology, the genotype values for some accessions cannot be clearly called. Additionally, allelic heterozygosity can be very high in some lines, causing genetic and sometimes phenotypic segregation in their descendants. Genetic and phenotypic segregation degrades the original accession’s specificity and makes it difficult to distinguish one accession from another. Therefore, it is vitally important to determine and validate HapMap accessions before one conducts a GWAS analysis. However, to the best of our knowledge, there are no prior methodologies or tools that can readily distinguish or validate multiple accessions in a HapMap population. We devised a bioinformatics approach to distinguish multiple HapMap accessions using only a minimum number of genetic markers. First, we assign each candidate marker with a distinguishing score (DS), which measures its capability in distinguishing accessions. The DS score prioritizes those markers with higher percentages of homozygous genotypes (allele combinations), as they can be stably passed on to offspring. Next, we apply the “set-partitioning” concept to select optimal markers by recursively partitioning accession sets. Subsequently, we build a hierarchical decision tree in which a specific path represents the selected markers and the homogenous genotypes that can be used to distinguish one accession from others in the HapMap population. Based on these algorithms, we developed a web tool named MAD-HiDTree (Multiple Accession Distinguishment-Hierarchical Decision Tree), designed to analyze a user-input genotype matrix and construct a hierarchical decision tree. Using genetic marker data extracted from the Medicago truncatula HapMap population, we successfully constructed hierarchical decision trees by which the original 262 M. truncatula accessions could be efficiently distinguished. PCR experiments verified our proposed method, confirming that MAD-HiDTree can be used for the identification of a specific accession. MAD-HiDTree was developed in C/C++ in Linux. Both the source code and test data are publicly available at https://bioinfo.noble.org/MAD-HiDTree/.
We combined the COI sequence data with legacy multigene sequence data to create a new, taxon-rich phylogeny for the Amaurobioidinae. We used sequences for four loci that have been used in previous studies on the subfamily: two mitochondrial loci, COI (658bp) and ribosomal subunit 16S (16S, 410bp); and two nuclear loci, Histone H3 (H3, 327bp) and ribosomal subunit 28S (28S, 839bp). We complemented the Amaurobioidinae data with sequences from several non-amaurobioidine anyphaenids and two clubionids as outgroups. Sequence alignment was performed using the MAFFT (ver. 7.308) plugin in Geneious, allowing MAFFT to automatically select an appropriate alignment strategy based on the properties of each locus, or with the online MAFFT server (https://mafft.cbrc.jp), which consistently selected the L-INS-i algorithm. Finally, alignments of the four loci were concatenated to construct a 2234 bp multigene sequence matrix containing 692 taxa, with about 55% missing/gap data (“full” matrix henceforth). To ensure that excessive missing data did not affect the resulting topology, we also constructed a reduced matrix by removing additional COI-only specimens so that each species and morphotype was represented by just one or two specimens for which all loci were available (where possible). After realignment, this reduced matrix was 2235 bp long, included 167 taxa, and had about 22% missing/gap data (“reduced” matrix henceforth). Phylogenetic analyses under maximum likelihood, including model selection, were then conducted with IQ-TREE 2. We performed phylogenetic analyses on both concatenated matrices (the full matrix and the reduced matrix) and on each individual locus. For model selection, we provided an initial scheme that partitioned the matrix by locus, and further partitioned the protein-coding loci (COI and H3) by codon position. We used ModelFinder and searched for the best partition scheme, all in IQ-TREE. The best models (partitions) for the full dataset were: GTR+F+I+G4 (16S), GTR+F+I+I+R4 (28S), TVM+F+I+I+R2 (COI-1), TIM2+F+R4 (COI-2), GTR+F+R5 (COI-3), TVMe+G4 (H3-1-H3-2), SYM+G4 (H3-3); and for the reduced dataset: GTR+F+I+G4 (16S), GTR+F+I+G4: (28S), GTR+F+I+G4: (COI-2), GTR+F+I+G4: (COI-3), TVM+F+I+G4: (COI-1, H3-2), GTR+F+I+G4: (H3-1), GTR+F+I+G4: (H3-3). For each dataset, once the best models and partitions were defined, we executed 10 independent replicates of tree calculations followed by 1000 ultrafast bootstrap replicates, and the replicate reaching the maximum likelihood was chosen. Phylogenetic analyses under parsimony were made with TNT, under equal weights, using the “new technology” search with default values, asking for 10 independent hits to the minimal length, and submitting the resulting trees to a round of TBR branch swapping.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Phylogenetic analysis of the spiders of the genus Cybaeulus, with outgroups in the marronoid clade. Data from six DNA markers, analyzed with maximum likelihood and parsimony.
PHYLOGENETIC ANALYSIS
We obtained sequences from 26 samples of the three known species of Cybaeolus, and of five additional species of Hahniidae. To these, we added legacy sequences of Cybaeolus and of other genera of Hahniidae, as well as representatives of the remaining families in the marronoid clade. For the new sequences, the extraction and amplification of DNA was made in the Laboratory of Molecular Tools at Museo Argentino de Ciencias Naturales (MACN), from tissues preserved in absolute alcohol at -18ºC. We targeted the markers histone H3 (H3), cytochrome oxidase subunit I (CO1), 28S ribosomal RNA (28S) and 16S ribosomal RNA (16S), previously used to estimate relationships of marronoid spiders (Wheeler et al., 2017). Details of extraction, primers and PCR protocols are the same as in Magalhaes & Ramírez (2022). Sequencing was outsourced to Macrogen Inc., South Korea. The resulting chromatograms were analyzed individually to detect contaminated sequences or ambiguous portions. In addition to these sequences obtained in the laboratory, we combined our data with additional sequences from previous work (Wheeler et al., 2017; Rivera-Quiroz et al., 2020), using the markers mentioned above plus 12S ribosomal RNA (12S) and 18S ribosomal RNA (18S). For the CO1 marker, additional sequences obtained by the Arachnology Division at MACN and deposited in the BOLDSYSTEMS platform (https://www.boldsystems.org/) were also used. Sequences were aligned with MAFFT Online v.7.463 (Katoh & Standley, 2013), using the L-INS-I algorithm. See Table 1 for list of vouchers and sequence identifiers.
Maximum likelihood
For the maximum likelihood analyses we used the program IQ-TREE 2.2.0 (Minh et al., 2020), partitioning the data by marker, and selecting the best combination of partitions and evolution models by Bayesian information criterion (best fitting models were TPM2+I+G4 for H3, GTR+F+I+G4 for 18S, GTR+F+I+G4 for 16S and 12S together, GTR+F+I+G4 for CO1, and GTR+F+I+G4 for 28S). Since the relationships of outgroup taxa in the resulting trees were slightly different to that found in recent phylogenomic studies, we used the study of Gorneau et al. (2023) based on ultraconserved elements as a backbone topology to constrain our tree search, considering only the taxa in common with our analysis (see supplementary Fig. S1); this means that all the rest of the taxa are free to move anywhere during tree search. Support for groups (branches) was estimated by 1000 cycles of ultrafast bootstrapping. Ten independent runs were performed; of those, six converged into nearly identical log likelihood values (-57417.7725 to -57417.9604) and identical topologies; the tree with top-ranking log likelihood is presented in Results, after collapsing branches with bootstrap below 0.5. To estimate the support of an alternative topology with Cybaeolus as sister to the rest of the hahniids, we used TNT 1.6 (Goloboff & Morales, 2023) to modify the optimal tree placing Cybaeolus in such position, and asked for the frequency of the branch of interest (all hahniids except Cybaeolus) in the 1000 bootstrapped trees previously saved by IQTREE.
Ancestral character states for the arrangement of spinnerets (grouped; separated in a transversal line) were estimated by maximum likelihood on the optimal tree, using the R packages phytools and ape, under the models ER and ARD, and the best fitting model selected by the Akaike information criterion.
Parsimony
For the parsimony analyses we used TNT 1.6. For the equal weights analysis, a heuristic search was made using a driven search with the default parameters of the “new technologies”, aiming for 10 independent hits to minimum length. The resulting trees were then submitted to an additional round of tree-bisection reconnection (TBR) branch swapping. These results were compared to a simpler search strategy of 300 random addition sequences, each followed by TBR, which produced 20 hits to minimal length. As both strategies reached the same trees with multiple independent hits, it is likely that the optimal trees were found. Finally, the strict consensus of all the optimal trees was obtained, and on this consensus the support values were calculated by means of 1000 bootstrap pseudoreplicates.
Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12). Data The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 6455 210 229 6894 DEFT DF English 19558 0 0 19558 Guidelines AMRs 819 0 0 819 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Xinhua MT 741 99 86 Totals 36521 1368 1371 39260 For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the “split” directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The “unsplit” directory contains the same 39,260 AMRs with no train/dev/test partition.
This data publication includes results and code from a systematic review of near-term ecological forecasting literature. The study had two primary goals: (1) analyze the state of near-term ecological forecasting literature, and (2) compare forecast skill across ecosystems and variables. We began by conducting a Web of Science search for “forecast*” in the title, abstract, and keywords of all papers published in ecological journals, then screened all papers from this search to identify near-term ecological forecasts. We defined a near-term ecological forecast as future predictions of community, population, or biogeochemical variables ≤ 10 years from the forecast date. To more broadly survey the literature, we then searched all papers that cited or were cited by the near-term ecological forecasts we identified. We performed an in-depth review of all near-term ecological forecasting papers identified through this search process, and recorded forecast skill data for all papers that reported R or R2. Our results indicate that the rate of publication of near-term ecological forecasts is increasing over time and the field is becoming increasingly open and automated. Across published forecasts, we find that forecast skill decreases in predictable patterns and these patterns differ between forecast variables. This data publication includes three products from this analysis: (1) a database of all papers identified in the two searches, including our assessment of whether they included an ecological focal variable, included a forecast, and whether the forecast was near-term (≤10 years), (2) a matrix of all data collected on the near-term ecological forecasts we identified, and (3) a database of R2 values for papers that reported R or R2.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
While the escalating impacts of climate change and other anthropogenic pressures on coral reefs are well documented at the level of the coral community, studies of species-specific trends are less common, owing mostly to the difficulties and uncertainties in delineating coral species. It has also become clear that traditional coral taxonomy that is largely based on skeletal morphology has underestimated the diversity of many coral families. Here we use targeted enrichment methods to sequence 2476 ultraconserved and exonic loci to investigate the relationship between populations of Fungia fungites from Okinawa, Japan, where this species reproduces by brooding (i.e., internal fertilization), and Papua New Guinea and Australia, where it reproduces by broadcast-spawning (i.e., external fertilization). We also analyzed the relationships between populations of additional fungiid species (Herpolitha limax and Ctenactis spp.) that reproduce only by broadcast-spawning. Our phylogenetic and species delimitation analyses reveal strong biogeographic structuring in both Fungia fungites and Herpolitha limax, consistent with cryptic speciation in Okinawa in both species and additionally in the Red Sea for H. limax. Using both ultraconserved elements and exon data, alongside mitochondrial data captured in off-target reads, we demonstrate that Ctenactis, a genus consisting of three nominal morphospecies, is not a natural group. Our results highlight the need for a comprehensive taxonomic and systematic revision of the coral family Fungiidae. The work presented here demonstrates that sequence data generated by the application of targeted capture methods can provide objective criteria by which to test hypotheses based on morphological and/or life history data.
Software/equipment used to create/collect the data: De-multiplexed Illumina reads were trimmed using the illumiprocessor wrapper program for trimmomatic with default values and assembled into contigs using SPAdes v. 3.10. The trimmed reads were processed using the PHYLUCE program workflow as outlined in the online tutorial http://phyluce.readthedocs.io/en/latest/tutorial-one.html with slight modifications.
Software/equipment used to manipulate/analyse the data: For UCE loci the Sliding-Window Site Characteristics (SWSC) method was used for partitioning within loci between UCE ‘core’ and ‘flanking’ regions to account for differences in site variability. Exon loci were assigned a separate partition for each locus. The SWSC-UCE and exon partitioning schemes were combined using Geneious Prime V2019.2.1. The best fitting partitioning scheme for SWSC-UCE/exon partitions was defined using PartitionFinder 2 (PF2) with the RAxML option. The 75% matrix alignment was analyzed with maximum likelihood (ML) using IQtree v2.0 and with Bayesian inference using Exabayes.
Species tree inference was conducted using ASTRAL III . Separate IQtree analyses with 1000 ultrafast bootstrap replicates were run for each of the 220 loci from the 75% matrix alignment. Resulting gene trees with bootstrap support were concatenated into a single file and branches with low support (<30%) were removed using the nw_ed function in the newick utility program. Unexpectedly long branches were also removed using TreeShrink as they are likely to be erroneous .
COI tree was obtained from the ‘off-reads’ of our target capture samples using phyluce_assembly_match_contigs_to_barcodes. ITS sequence from Ctenactis crassa (NCBI Genbank accession: EU149814) and the COI gene from a complete Favites abdita mitogenome (KY094479, 1542 bp), the closest complete mitogenome currently available, was used as template barcodes.
All available COI and ITS genes for Fungiidae were downloaded from GenBank. A new alignment was created by concatenating and aligning the UCE/exon dataset, the extracted mitochondrial and nuclear ITS regions from off-target reads and the GenBank sequences using Geneious prime. A ML tree was generated from all data combined using IQtree, partitioning mitochondrial data and ITS regions separately from UCE/exon partitions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many complex diseases are caused by a variety of both genetic and environmental factors acting in conjunction. To help understand these relationships, nonparametric methods that use aggregate learning have been developed such as random forests and conditional forests. Molinaro et al. (2010) described a powerful, single model approach called partDSA that has the advantage of producing interpretable models. We propose two extensions to the partDSA algorithm called bagged partDSA and boosted partDSA. These algorithms achieve higher prediction accuracies than individual partDSA objects through aggregating over a set of partDSA objects. Further, by using partDSA objects in the ensemble, each base learner creates decision rules using both “and” and “or” statements, which allows for natural logical constructs. We also provide four variable ranking techniques that aid in identifying the most important individual factors in the models. In the regression context, we compared bagged partDSA and boosted partDSA to random forests and conditional forests. Using simulated and real data, we found that bagged partDSA had lower prediction error than the other methods if the data were generated by a simple logic model, and that it performed similarly for other generating mechanisms. We also found that boosted partDSA was effective for a particularly complex case. Taken together these results suggest that the new methods are useful additions to the ensemble learning toolbox. We implement these algorithms as part of the partDSA R package. Supplementary materials for this article are available online.