Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: Parallel NFT AMM Volume Competition
Facebook
TwitterIn the past, the majority of data analysis use cases was addressed by aggregating relational data. Since a few years, a trend is evolving, which is called “Big Data” and which has several implications on the field of data analysis. Compared to previous applications, much larger data sets are analyzed using more elaborate and diverse analysis methods such as information extraction techniques, data mining algorithms, and machine learning methods. At the same time, analysis applications include data sets with less or even no structure at all. This evolution has implications on the requirements on data processing systems. Due to the growing size of data sets and the increasing computational complexity of advanced analysis methods, data must be processed in a massively parallel fashion. The large number and diversity of data analysis techniques as well as the lack of data structure determine the use of user-defined functions and data types. Many traditional database systems are not flexible enough to satisfy these requirements. Hence, there is a need for programming abstractions to define and efficiently execute complex parallel data analysis programs that support custom user-defined operations. The success of the SQL query language has shown the advantages of declarative query specification, such as potential for optimization and ease of use. Today, most relational database management systems feature a query optimizer that compiles declarative queries into physical execution plans. Cost-based optimizers choose from billions of plan candidates the plan with the least estimated cost. However, traditional optimization techniques cannot be readily integrated into systems that aim to support novel data analysis use cases. For example, the use of user-defined functions (UDFs) can significantly limit the optimization potential of data analysis programs. Furthermore, lack of detailed data statistics is common when large amounts of unstructured data is analyzed. This leads to imprecise optimizer cost estimates, which can cause sub-optimal plan choices. In this thesis we address three challenges that arise in the context of specifying and optimizing data analysis programs. First, we propose a parallel programming model with declarative properties to specify data analysis tasks as data flow programs. In this model, data processing operators are composed of a system-provided second-order function and a user-defined first-order function. A cost-based optimizer compiles data flow programs specified in this abstraction into parallel data flows. The optimizer borrows techniques from relational optimizers and ports them to the domain of general-purpose parallel programming models. Second, we propose an approach to enhance the optimization of data flow programs that include UDF operators with unknown semantics. We identify operator properties and conditions to reorder neighboring UDF operators without changing the semantics of the program. We show how to automatically extract these properties from UDF operators by leveraging static code analysis techniques. Our approach is able to emulate relational optimizations such as filter and join reordering and holistic aggregation push-down while not being limited to relational operators. Finally, we analyze the impact of changing execution conditions such as varying predicate selectivities and memory budgets on the performance of relational query plans. We identify plan patterns that cause significantly varying execution performance for changing execution conditions. Plans that include such risky patterns are prone to cause problems in presence of imprecise optimizer estimates. Based on our findings, we introduce an approach to avoid risky plan choices. Moreover, we present a method to assess the risk of a query execution plan using a machine-learned prediction model. Experiments show that the prediction model outperforms risk predictions which are computed from optimizer estimates.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: Parallel-prime-token-holder
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English-Dutch Bilingual Parallel Corpora dataset for the Management domain. This comprehensive dataset contains a large collection of bilingual sentence pairs, carefully translated between English and Dutch, designed to support the development of management-specific language models, natural language processing systems, and machine translation engines.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English-Bengali Bilingual Parallel Corpora dataset for the Management domain. This comprehensive dataset contains a large collection of bilingual sentence pairs, carefully translated between English and Bengali, designed to support the development of management-specific language models, natural language processing systems, and machine translation engines.
Facebook
TwitterThese files can be used to re-create the results in the thesis manuscript: "Polyhedral Optimizations of RNA-RNA Interaction Computations". One will need to use the AlphaZ tool (http://www.cs.colostate.edu/AlphaZ/wiki/doku.php ) to produce result from the .ab and .cs file. For users not aware of AlphaZ can use the C codes.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English-Bulgarian Bilingual Parallel Corpora dataset for the Management domain. This comprehensive dataset contains a large collection of bilingual sentence pairs, carefully translated between English and Bulgarian, designed to support the development of management-specific language models, natural language processing systems, and machine translation engines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SauLTC (Saudi Learner Translator Corpus) is a uni-directional POS-tagged English-Arabic parallel and sentence-aligned learner corpus. This multi-version corpus features linguistic annotation, complemented with an interface for monolingual or bilingual querying of the data.
SauLTC was initiated as a source of data for translation studies and learner corpora research. The corpus can be utilized in the examination of linguistic properties of translations, translation quality assessment, variation in translation, translational competence, and cross-linguistic transference. For example, one of SauLTC’s functionalities allows the examination of what a trainee translator produces on her own (draft translation) and the effect of an expert translator’s feedback (final translation submission).
The translation program at PNU includes a four-credit hour graduation project course, where students, at the end of the program, are required to translate a booklet or book chapter from English into Arabic. The course is designed to help students demonstrate their translation competence and to apply all the skills they have acquired into translating longer texts (6000+). The typical arrangement of the course is as follows: The student selects a text she prefers and obtains her instructor’s approval. The source texts of SauLTC include chapters or booklet extractions from fiction, self-help, biography, history, health, psychology, religion, culture, management, or science. Then she submits a draft translation to her instructor who reads the translation and meets with the student to discuss her output. The discussion highlights both the strengths and weaknesses of the translation and gives the student the chance to justify her linguistic choices when translating the text. Based on the instructor’s feedback, the student makes the necessary changes in the translation and once again submits it to the instructor (final translation). The translation students and instructors give their consent to include their translation projects (the source texts, the first drafts, and the final drafts of their translations) in SauLTC. All three documents are collected in one folder under the student’s name in addition to the student’s profile information.
The corpus is currently in its first version of two million words with a proposed plan to include more translated texts from students at other universities in Saudi Arabia. Each student’s contribution includes a learner profile, the source text, the draft translation, and the post-instructor feedback final submission, all of which are enriched with searchable profile metadata.
The Auto-aligner at WordFast Anywhere was utilized for the automatic parallelization of the source text, draft text, and final submission text. This automated process was followed by a manual verification conducted by professional translators.
The query interface supports lexical and PoS search for both sources and targets and returns sentences with the query item along with their targets/sources. The query results can be filtered by several metadata fields, including the translator’s age, translation assessment grade, year, and source text genre.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The bowerbirds in New Guinea and Australia include species that build the largest and perhaps most elaborately decorated constructions outside of humans. The males use these courtship bowers, along with their displays, to attract females. In these species, the mating system is polygynous and the females alone incubate and feed the nestlings. The bowerbirds also include 10 species of the socially monogamous catbirds in which the male participates in most aspects of raising the young. How the bower-building behavior evolved has remained poorly understood, as no comprehensive phylogeny exists for the family. It has been assumed that the monogamous catbird clade is sister to all polygynous species. We here test this hypothesis using a newly developed pipeline for obtaining homologous alignments of thousands of exonic and intronic regions from genomic data to build a phylogeny. Our well-supported species tree shows that the polygynous, bower-building species are not monophyletic. The result suggests either that bower-building behavior is an ancestral condition in the family that was secondarily lost in the catbirds, or that it has arisen in parallel in two lineages of bowerbirds. We favor the latter hypothesis based on an ancestral character reconstruction showing that polygyny but not bower-building is ancestral in bowerbirds, and on the observation that Scenopoeetes dentirostris, the sister species to one of the bower-building clades, does not build a proper bower but constructs a court for male display. This species is also sexually monomorphic in plumage despite having a polygynous mating system. We argue that the relatively stable tropical and subtropical forest environment in combination with low predator pressure and rich food access (mostly fruit) facilitated the evolution of these unique life-history traits.
Methods This is supplementary material to the manuscript "Parallel evolution of bower-building behavior and polygyny in two groups of bowerbirds suggested by phylogenomics". We used the Birdscanner pipeline (available at github.com/Naturhistoriska/birdscanner.git) to obtain homologous alignments of 5653 exonic and 7020 intronic regions from whole-genome sequence data. The pipeline utilize probabilistic queries using hidden Markov models that were used to probe the mapped bowerbird genomes to find where they had their best fit. For each query and taxon we obtained genomic coordinates for the best hits that were then ranked according to their “sequence E-values”, i.e. the expected number of false positives (non-homologous sequences) that scored this well or better. For each query and taxon the sequences for the hits with the lowest values were parsed out using the genomic coordinates. These were then aligned in separate files for exonic and intronic loci. Poorly aligned sequences were identified, based on a calculated distance matrix using OD-Seq (github.com/PeterJehl/OD-Seq), and excluded from the further analyses. We also checked the alignments manually and removed those that included non-homologous sequences for some taxa (indicated by an extreme proportion of variable positions in the alignment) and those that contained no phylogenetically information. Individual trees were constructed using IQ-TREE that automatically selects the best substitution model for each loci alignment. We used ASTRAL-III to construct species trees from the gene trees both for the exonic and intronic loci separately and for all loci combined. ASTRAL estimates a species tree given a set of unrooted gene trees and branch support is calculated using local posterior probabilities. We assembled mitochondrial genomes from the resequenced data for each individual using MITObim , and used 12 of the 13 protein-coding genes to infer the phylogenetic tree. The aligned mitochondrial data set used in the analyses consists of 10,560 bp (3,520 codons). The phylogenetic analysis of the mitogenomic data set was performed with MEGA X . We estimated the maximum-likelihood tree for the mitochondrial data using 100 bootstrap replicates to assess the reliability of the branches. The data set was analyzed both with all codon positions present and with the third codon positions excluded.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English-Portuguese Gaming Parallel Corpora is a curated bilingual dataset designed to support game localization, machine translation, and language model training for the Gaming industry. It consists of over 50,000 sentence pairs, professionally translated between English and Portuguese, capturing the linguistic and cultural depth of gaming content.
Facebook
TwitterEuroPIRQ: European Parallel Information Retrieval Queries
Dataset Details
The EuroPIRQ retrieval dataset is a multilingual collection designed for evaluating retrieval and cross-lingual retrieval tasks. Dataset contains 10,000 parallel passages & 100 parallel queries (synthetic) in three languages: English🇬🇧, Portuguese🇵🇹, and Finnish🇫🇮, constructed from the European Union's DGT-Acquis corpus.
Languages: English (en), Portuguese (pt), Finnish (fi) Format: JSONL… See the full description on the dataset page: https://huggingface.co/datasets/eherra/EuroPIRQ-retrieval.
Facebook
TwitterThis is a test for my DPO dataset for Irish ENglish trasnlslations, raw data origin : https://www.gaois.ie/en/corpora/parallel?Query=Apple&Language=en&SearchMode=exact&PerPage=50, used COMETXL refrernce free maodel Unbabel/wmt23-cometkiwi-da-xl (which has been trained to asses Irish) to score accepted/rejected. Used GPT4 to generate translations to compare with human stranslations of Irish legislation (which has to have a Irisng/English copy by law)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Parameters used in performance evaluation for synthetic data.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English-German Bilingual Parallel Corpora dataset for the Management domain. This comprehensive dataset contains a large collection of bilingual sentence pairs, carefully translated between English and German, designed to support the development of management-specific language models, natural language processing systems, and machine translation engines.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains a Nepali-English parallel speech corpus designed to support research in speech recognition, speech translation, and multilingual language processing.
It includes audio pairs and their corresponding transcriptions in both Nepali and English, carefully aligned.
Total Nepali Audio Duration: ~125 minutes Total English Audio Duration: ~112 minutes Number of Speakers: 4 (covering both male and female voices) Content Type: Conversational and general-purpose sentences covering common expressions when one is travelling through the country, short queries, and contextually varied speech samples. Format: Each audio file is accompanied by a corresponding text transcription and metadata linking the Nepali and English versions.
Purpose: Created to assist researchers, developers, and linguists working on Nepali-English speech technologies, such as automatic speech recognition (ASR), speech-to-text (STT), and speech translation systems.
All data has been manually curated and verified to ensure clarity, alignment accuracy, and quality balance across both languages.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English-Malayalam Bilingual Parallel Corpora dataset for the Management domain. This comprehensive dataset contains a large collection of bilingual sentence pairs, carefully translated between English and Malayalam, designed to support the development of management-specific language models, natural language processing systems, and machine translation engines.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English-Arabic Bilingual Parallel Corpora dataset for the Management domain. This comprehensive dataset contains a large collection of bilingual sentence pairs, carefully translated between English and Arabic, designed to support the development of management-specific language models, natural language processing systems, and machine translation engines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Definitions of common notations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As large-scale deep learning models become integral to scientific discovery and engineering applications, it is increasingly important to teach students how to implement them efficiently and at scale. This section presents a coding assignment that focuses on optimizing the Softmax function, a central component of many deep learning models, including attention mechanisms in transformer models. The assignment is designed for an undergraduate level Distributed Computing course , and tailored to students with little or no prior experience in machine learning.By integrating modern AI workloads into an HPC curriculum, this work equips students with both the conceptual understanding and practical experience needed to build scalable solutions in scientific computing.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
(:unav)...........................................
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: Parallel NFT AMM Volume Competition