100+ datasets found

i
DNA sequence alignment datasets based on NW algorithm
ieee-dataport.org
Updated May 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amr Rashed (2022). DNA sequence alignment datasets based on NW algorithm [Dataset]. https://ieee-dataport.org/documents/dna-sequence-alignment-datasets-based-nw-algorithm
Explore at:
Dataset updated
May 18, 2022
Authors
Amr Rashed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
namely
f
Application of de Novo Sequencing to Large-Scale Complex Proteomics Data...
acs.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arun Devabhaktuni; Joshua E. Elias (2023). Application of de Novo Sequencing to Large-Scale Complex Proteomics Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.5b00861.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.5b00861.s003
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Arun Devabhaktuni; Joshua E. Elias
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dependent on concise, predefined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from wholly uncharacterized protein products. Conversely, de novo peptide sequencing algorithms can interpret mass spectra without relying on reference databases. However, such algorithms have been difficult to apply to complex protein mixtures, in part due to a lack of methods for automatically validating de novo sequencing results. Here, we present novel metrics for benchmarking de novo sequencing algorithm performance on large-scale proteomics data sets and present a method for accurately calibrating false discovery rates on de novo results. We also present a novel algorithm (LADS) that leverages experimentally disambiguated fragmentation spectra to boost sequencing accuracy and sensitivity. LADS improves sequencing accuracy on longer peptides relative to that of other algorithms and improves discriminability of correct and incorrect sequences. Using these advancements, we demonstrate accurate de novo identification of peptide sequences not identifiable using database search-based approaches.
f
Data_Sheet_1_SLDMS: A Tool for Calculating the Overlapping Regions of...
frontiersin.figshare.com
zip
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Chen; DongLiang You; TianJiao Zhang; GuoHua Wang (2023). Data_Sheet_1_SLDMS: A Tool for Calculating the Overlapping Regions of Sequences.zip [Dataset]. http://doi.org/10.3389/fpls.2021.813036.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fpls.2021.813036.s001
Dataset updated
Jun 4, 2023
Dataset provided by
Frontiers
Authors
Yu Chen; DongLiang You; TianJiao Zhang; GuoHua Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the field of genome assembly, contig assembly is one of the most important parts. Contig assembly requires the processing of overlapping regions of a large number of DNA sequences and this calculation usually takes a lot of time. The time consumption of contig assembly algorithms is an important indicator to evaluate the degree of algorithm superiority. Existing methods for processing overlapping regions of sequences consume too much in terms of running time. Therefore, we propose a method SLDMS for processing sequence overlapping regions based on suffix array and monotonic stack, which can effectively improve the efficiency of sequence overlapping regions processing. The running time of the SLDMS is much less than that of Canu and Flye in dealing with the sequence overlap interval and in some data with most sequencing errors occur at both the ends of the sequencing data, the running time of the SLDMS is only about one-tenth of the other two methods.
d
sequenceMiner algorithm
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). sequenceMiner algorithm [Dataset]. https://catalog.data.gov/dataset/sequenceminer-algorithm
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Detecting and describing anomalies in large repositories of discrete symbol sequences. sequenceMiner has been open-sourced! Download the file below to try it out. sequenceMiner was developed to address the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. sequenceMiner works by performing unsupervised clustering (grouping) of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by a detailed analysis of outliers to detect anomalies. sequenceMiner utilizes a new hybrid algorithm for computing the LCS that has been shown to outperform existing algorithms by a factor of five. sequenceMiner also includes new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. This provides analysts with a coherent description of the anomalies identified in the sequence, and why they differ from more “normal” sequences. sequenceMiner was developed with funding from the NASA Aviation Safety Program. In the commercial aviation domain, sequenceMiner can be used to discover atypical behavior in airline performance data that may have possible operational significance for safety analysts. But because the sequenceMiner approach is general and not restricted in any way to a domain, and these algorithms can be applied in other fields where anomaly detection and event mining would be useful.
Tiling Results of 4 AAV sequencing datasets produced on a Pacific...
zenodo.org
tar
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Bruccoleri; Robert Bruccoleri (2025). Tiling Results of 4 AAV sequencing datasets produced on a Pacific Biosciences Sequel II Sequencer. [Dataset]. http://doi.org/10.5281/zenodo.16064305
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16064305
Dataset updated
Jul 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robert Bruccoleri; Robert Bruccoleri
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This record contains the tiling algorithm output for the four PacBio AAV datasets analyzed in our paper, "The Tiling Algorithm -- A general method for structural characterization of accurate long DNA sequence reads: application to AAV genome sequences" to be submitted to Plos ONE.
Data from: THE INTEGER SEQUENCE A348960
zenodo.org
data.niaid.nih.gov
pdf
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul F. Marrero Romero; Paul F. Marrero Romero (2024). THE INTEGER SEQUENCE A348960 [Dataset]. http://doi.org/10.5281/zenodo.5722327
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5722327
Dataset updated
Jul 17, 2024
Dataset provided by
On-Line Encyclopedia of Integer Sequences//oeis.org/
Authors
Paul F. Marrero Romero; Paul F. Marrero Romero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a sequence of integers that was registered by me in the On-line Encyclopedia of Integers (OEIS) under the code: A348960. In this report, the sequence is evaluated for integers n, taking values within the range [0,70]. An algorithm in MATHEMATICA for the sequence under study is also presented.

The empirically discovered sequence is as follows: \(a(n) = \lfloor log(\pi n!)\rfloor\)
c
Anomaly Detection in Sequences
s.cnmilf.com
datasets.ai
+3more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Anomaly Detection in Sequences [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/anomaly-detection-in-sequences
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
We present a set of novel algorithms which we call sequenceMiner, that detect and characterize anomalies in large sets of high-dimensional symbol sequences that arise from recordings of switch sensors in the cockpits of commercial airliners. While the algorithms we present are general and _domain-independent, we focus on a specific problem that is critical to determining system-wide health of a fleet of aircraft. The approach taken uses unsupervised clustering of sequences using the normalized length of he longest common subsequence (nLCS) as a similarity measure, followed by a detailed analysis of outliers to detect anomalies. In this method, an outlier sequence is defined as a sequence that is far away from a cluster. We present new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence is deemed to be an outlier. The algorithm provides a coherent description to an analyst of the anomalies in the sequence when compared to more normal sequences. The final section of the paper demonstrates the effectiveness of sequenceMiner for anomaly detection on a real set of discrete sequence data from a fleet of commercial airliners. We show that sequenceMiner discovers actionable and operationally significant safety events. We also compare our innovations with standard HiddenMarkov Models, and show that our methods are superior
Edge-AI Pathogen Sequencer Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Edge-AI Pathogen Sequencer Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/edge-ai-pathogen-sequencer-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Aug 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Edge-AI Pathogen Sequencer Market Outlook

According to our latest research, the global Edge-AI Pathogen Sequencer market size reached USD 1.43 billion in 2024, reflecting the rapid adoption of advanced sequencing technologies integrated with artificial intelligence at the edge. The market is projected to expand at a robust CAGR of 17.8% from 2025 to 2033, reaching a forecasted value of USD 6.19 billion by 2033. This impressive growth is primarily fueled by the urgent need for real-time, decentralized pathogen detection, especially in clinical diagnostics and public health surveillance, where rapid and accurate identification of pathogens is critical for mitigating outbreaks and ensuring safety.

One of the primary growth drivers for the Edge-AI Pathogen Sequencer market is the escalating demand for rapid, point-of-care diagnostics, particularly in the wake of recent global health crises such as the COVID-19 pandemic. Traditional centralized sequencing approaches often suffer from delays due to sample transport and processing times, which can hamper timely responses to infectious disease outbreaks. Edge-AI enabled sequencers, by contrast, provide near-instantaneous results directly at the source, whether in remote clinics, field hospitals, or environmental monitoring stations. This capability is revolutionizing the approach to pathogen detection, allowing healthcare providers and public health authorities to make data-driven decisions faster, thereby improving patient outcomes and reducing the spread of infectious diseases.

Another significant growth factor is the integration of advanced AI algorithms with sequencing hardware, which enables on-device data processing and analysis. This reduces dependency on high-bandwidth internet connections and centralized data centers, making sequencing more accessible in resource-limited settings. Edge-AI Pathogen Sequencers leverage machine learning models to filter, analyze, and interpret sequencing data in real time, drastically reducing turnaround times and operational costs. As the sophistication of these AI models increases, so does the accuracy and reliability of pathogen identification, which is particularly important for detecting emerging or rare pathogens. The convergence of AI and edge computing with next-generation sequencing technologies is thus opening new frontiers in both clinical and non-clinical applications, such as food safety and environmental monitoring.

Moreover, the growing prevalence of antimicrobial resistance (AMR) and the need for robust surveillance systems are catalyzing the adoption of edge-AI sequencing solutions. Governments and international organizations are investing heavily in advanced pathogen surveillance infrastructure to track and contain AMR and other infectious threats. Edge-AI Pathogen Sequencers, with their ability to provide granular, location-specific data, are becoming indispensable tools in these efforts. Additionally, the miniaturization and cost reduction of sequencing devices are making them more viable for widespread deployment, from large hospital networks to small field labs, further accelerating market growth.

Regionally, North America continues to dominate the Edge-AI Pathogen Sequencer market, accounting for approximately 38% of the global revenue in 2024. This leadership is attributed to strong R&D investments, the presence of major industry players, and early adoption of advanced healthcare technologies. Europe follows closely, driven by robust regulatory frameworks and increasing emphasis on public health preparedness. The Asia Pacific region is emerging as the fastest-growing market, propelled by expanding healthcare infrastructure, rising disease burden, and government initiatives to modernize diagnostic capabilities. Latin America and the Middle East & Africa are also witnessing steady adoption, particularly in public health and food safety applications, though market penetration remains comparatively lower due to budgetary constraints and infrastructural challenges.

Product Type Analysis</h2&
Supplementary Data for "Mapper: fast and accurate sequence alignment via...
figshare.com
txt
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anni Zhang (2024). Supplementary Data for "Mapper: fast and accurate sequence alignment via gapped x-mers" [Dataset]. http://doi.org/10.6084/m9.figshare.25976434.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25976434.v4
Dataset updated
Nov 13, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Anni Zhang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Evaluation Results:Suboptimal alignment rates:The folder "alignment_accuracy/" and "alignment_accuracy_ANI" contains the following:Alignment results for the same read compared across different aligners. Files ending with withtag.txt represent all alignment results, while files ending with withtagclean.txt exclude middle soft clips reported by BWA and Minimap2.A summary of suboptimal alignment rates.Potential causes of suboptimal alignments.Code to visualize the results (Figures 4, 5, S1, and S3) is available at https://github.com/caozhichongchong/Mapper_eva (Alignment_accuracy.ipynb and Alignment_accuracy_ANI.ipynb)Alignment inconsistency rate:The folder "alignment_consistency/" contains a summary of alignment inconsistency rates and suboptimal alignment rates.Code to visualize the results (Figures 6 and S4) is available at https://github.com/caozhichongchong/Mapper_eva (Alignment_consistency.ipynb)Speed:The folder "speed_evaluation" contains runtime and memory usage data for all aligners during library indexing and read alignment.Code to visualize the results (Figures 7, S5. and S6) can be found at https://github.com/caozhichongchong/Mapper_eva (Alignment_accuracy.ipynb)Alignment example for Figure S2: The folder "alignment_example" contains the code and SAM files supporting the alignment results for an example read shown in Figure S2.
n
ADDA - Automatic Domain Decomposition Algorithm
neuinfo.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ADDA - Automatic Domain Decomposition Algorithm [Dataset]. http://identifiers.org/RRID:SCR_007546
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007546
Description
This is a web interface for ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. ADDA is downloadable. There are three ways in which you can retrieve a protein sequence and its domains from ADDA. Sequences can be located using sequence identifiers and/or accession numbers, using a identical fragment lookup, or by running BLAST against all sequences in ADDA. ADDA is a protein sequence clustering algorithm. It takes a set of sequences and returns domain families. ADDA has two steps corresponding to the two aspects of the protein sequence clustering domain. First, ADDA splits protein sequences into domains. The idea behind ADDA is in principle the application of Occam''s razor; the goal is to describe the diversity of protein sequences with a minimal set of protein domains. The algorithm behind ADDA approximates this minimal set. In practice ADDA works by looking at where BLAST alignments are located on the sequence and splits the sequences, so that as few as possible alignments are cut by domain boundaries and that as many alignments as possible stretch over complete domains. Secondly, ADDA takes all the domains and then arranges them in a minimum spanning tree, where the similarity between two domains is determined by their relative overlap given a BLAST alignment. Each link in the tree is then checked by a pairwise profile-profile comparison and links below a threshold are removed. The remaining connected components are then taken to represent protein domain families.
o
Data from: In vivo functional phenotypes from a computational epistatic...
explore.openaire.eu
search.dataone.org
+2more
Updated Jan 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faruck Morcos; Sophia Alvarez; Charisse Nartey; Nicholas Mercado; Alberto de la Paz; Tea Huseinbegovic (2024). Data from: In vivo functional phenotypes from a computational epistatic model of evolution [Dataset]. http://doi.org/10.5061/dryad.n5tb2rc1c
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.n5tb2rc1c
Dataset updated
Jan 17, 2024
Authors
Faruck Morcos; Sophia Alvarez; Charisse Nartey; Nicholas Mercado; Alberto de la Paz; Tea Huseinbegovic
Description
Data from: In vivo functional phenotypes from a computational epistatic model of evolution --- This dataset includes sequence data, model parameters, similation trajectories and experimental data for Sanger sequecing related to a model of sequence evolution called Sequence Evolution with Epistatic Contributions (SEEC) applied to beta-lactamase TEM-1. ## Description of the data and file structure SI Dataset S4 (PhaseI_MSA.fasta)Phase I multiple sequence alignment used for SEEC-AA mfDCA and bmDCA statistical inference was obtained from Pfam and pre-processed to remove sequences with more than 5% consecuitve gaps. SI Dataset S5 (PhaseII_MSA.fasta)Phase II multiple sequence alignment used for SEEC-NT mfDCA and bmDCA statistical inference was generated using HMMTools with TEM-1 sequence as seed (excluding signal petide) and default parameters. SI Dataset S6 (PhaseII_mfDCA_Parameters.mat)Coupling and local field matrices inferred using mean field DCA with the PhaseII MSA as the input. Objects: 1. PhaseII_mfDCA_eij (size=5523x5523) 2. PhaseII_mfDCA_hi (size=21x263) This file is a .mat readable in Matlab and compatible with the code found at Github (github.com/morcoslab/SEEC-NT) SI Dataset S7 (PhaseII_bmDCA_Parameters.mat) Objects: 1. eij (size=5523x5523) 2. hi (size=21x263) Coupling (eij) and local field (hi) matrices inferred using Boltzmann machine learning DCA with the PhaseII MSA as the input. The eij matrices have been converted into the format that matches the output of mfDCA. This file is a .mat readable in Matlab and compatible with the code found at Github (github.com/morcoslab/SEEC-NT) SI Dataset S8 (SEEC_nt_sequence_trajectories.mat) Objects: 1. SEEC_nt_bmDCA_Trajectory_amino_T0_75_3 (size=5000x263) 2. SEEC_nt_mfDCA_Trajectory_amino_T1_5_1 (size=5000x263) Sequences output from SEEC-nt used for variant selection. A .mat file readable in Matlab. SI Dataset S9 (Figure_3_Sanger_Sequencing_Data.zip),SI Dataset S10 (Figure_4_Sanger_Sequencing_Data.zip) Raw Sanger Sequencing chromatograms collected from plasmid samples isolated from assay cultures. Naming of chromatograms is as follows: First number refers to the batch of sequencing. Second number is the sample run within that batch. "For" or "rev" refers to the sequencing forward or reverse directions, respectively. The rest of the name comes from the variant name as used in the manuscript, where Beg, Mid or Late refer to positions in the simulation trajectory, bm or mf refer to the DCA implementation used to infer the coupling and local field parameters, the number is the variant number, and NT indicates the algorithm used was SEEC-nucleotide. Sanger sequencing Chromatograms can be viewed using free software such as 4peaks (), Benchling (), or a number of other platforms. ## Code/Software Code and scripts used to generate the data in this repository can be found at The Boltzman machine DCA (bmDCA) implementation used can be found at Direct coupling analysis methods used were mean field (https://github.com/morcoslab/SEEC-nt) or Boltzman machine learning (https://github.com/matteofigliuzzi/bmDCA) SI Dataset S4 (PhaseI_MSA.fasta)Phase I multiple sequence alignment used for SEEC-AA mfDCA and bmDCA statistical inference was obtained from Pfam and pre-processed to remove sequences with more than 5% consecuitve gaps. SI Dataset S5 (PhaseII_MSA.fasta)Phase II multiple sequence alignment used for SEEC-NT mfDCA and bmDCA statistical inference was generated using HMMTools with TEM-1 sequence as seed (excluding signal petide) and default parameters. SI Dataset S6 (PhaseII_mfDCA_Parameters.mat) 1. PhaseII_mfDCA_eij (size=5523x5523)2. PhaseII_mfDCA_hi (size=21x263)Coupling and local field matrices inferred using mean field DCA with the PhaseII MSA as the input. This file is a .mat readable in Matlab and compatible with the code found at Github (github.com/morcoslab/SEEC-NT) SI Dataset S7 (PhaseII_bmDCA_Parameters.mat) 1. eij (size=5523x5523)2. hi (size=21x263)Coupling (eij) and local field (hi) matrices inferred using Boltzmann machine learning DCA with the PhaseII MSA as the input. The eij matrices have been converted into the format that matches the output of mfDCA. This file is a .mat readable in Matlab and compatible with the code found at Github (github.com/morcoslab/SEEC-NT) SI Dataset S8 (SEEC_nt_sequence_trajectories.mat) Variables: 1. SEEC_nt_bmDCA_Trajectory_amino_T0_75_3 (size=5000x263)2. SEEC_nt_mfDCA_Trajectory_amino_T1_5_1 (size=5000x263)Sequences output from SEEC-nt used for variant selection. A .mat file readable in Matlab. SI Dataset S9 (Figure_3_Sanger_Sequencing_Data.zip),SI Dataset S10 (Figure_4_Sanger_Sequencing_Data.zip)Raw Sanger Sequencing chromatograms collected from plasmid samples isolated from assay cultures. Naming of chromatograms is as follows:first number refers to the batch of sequencingsecond number is the sample run within that batch"for" or "rev" refers to the s...
d
SABmark
dknet.org
scicrunch.org
+3more
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). SABmark [Dataset]. http://identifiers.org/RRID:SCR_011817
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_011817 https://identifiers.org/RRID:SCR_011817/resolver
Dataset updated
Jul 27, 2025
Description
Downloadable data set designed to assess the performance of both multiple and pairwise (protein) sequence alignment algorithms, and is extremely easy to use. Currently, the database contains 2 sets, each consisting of a number of subsets with related sequences. It''s main features are: * Covers the entire known fold space (SCOP classification), with subsets provided by the ASTRAL compendium * All structures have high quality, with 100% resolved residues * Structure alignments have been derived carefully, using both SOFI and CE, and Relaxed Transitive Alignment * At most 25 sequences in each subset to avoid overrepresentation of large folds* Automated running, archiving and scoring of programs through a few Perl scripts The Twilight Zone set is divided into sequence groups that each represent a SCOP fold. All sequences within a group share a pairwise Blast e-value of at least 1, for a theoretical database size of 100 million residues. Sequence similarity is thus very low, between 0-25% identity, and a (traceable) common evolutionary origin cannot be established between most pairs even though their structures are (distantly) similar. This set therefore represents the worst case scenario for sequence alignment, which unfortunately is also the most frequent one, as most related sequences share less than 25% identity. The Superfamilies set consists of groups that each represent a SCOP superfamily, and therefore contain sequences with a (putative) common evolutionary origin. However, they share at most 50% identity, which is still challenging for any sequence alignment algorithm. Frequently, alignments are performed to establish whether or not sequences are related. To benchmark this, a second version of both the Twilight Zone and the Superfamilies set is provided, in which to each alignment problem a number of false positives, i.e. sequences not related to the original set, are added. Database specifications: * Current version: 1.65 (concurrent with PDB, SCOP and ASTRAL) * Twilight Zone set (with false positives): 209 groups, 1740 (3280) sequences, 10667 (44056) related pairs * Superfamilies set (with false positives): 425 groups, 3280 (6526) sequences, 19092 (79095) related pairs
q
Audio outputs of the first note sequence algorithm: LEMorpheus software.
researchdatafinder.qut.edu.au
researchdata.edu.au
Updated Jun 7, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rene Wooller (2010). Audio outputs of the first note sequence algorithm: LEMorpheus software. [Dataset]. https://researchdatafinder.qut.edu.au/display/q110
Explore at:
Dataset updated
Jun 7, 2010
Dataset provided by
Queensland University of Technology (QUT)
Authors
Rene Wooller
Description
Audio outputs of the first note sequence algorithm that was developed and applied to the LEMorpheus software infrastructure. The LEMorpheus software system takes source and target note sequences and applies a user-selected note sequence morphing to create the morphed material.
m
Dataset for "To denoise or to cluster? That is not the question. Optimizing...
data.mendeley.com
Updated Jan 6, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xavier Turon (2021). Dataset for "To denoise or to cluster? That is not the question. Optimizing pipelines for COI metabarcoding and metaphylogeography [Dataset]. http://doi.org/10.17632/84zypvmn2b.2
Explore at:
Unique identifier
https://doi.org/10.17632/84zypvmn2b.2
Dataset updated
Jan 6, 2021
Authors
Xavier Turon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the relevant files for a study optimizing and combining denoising and clustering algorithms for COI metabarcoding. The abstract is:

The recent blooming of metabarcoding applications to biodiversity studies comes with some relevant methodological debates. One such issue concerns the treatment of reads by denoising or by clustering methods, which have been wrongly presented as alternatives. It has also been suggested that denoised sequence variants should replace clusters as the basic unit of metabarcoding analyses, missing the fact that sequence clusters are a proxy for species-level entities, the basic unit in biodiversity studies. We argue here that methods developed and tested for ribosomal markers have been uncritically applied to highly variable markers such as cytochrome oxidase I (COI) without conceptual or operational (e.g., parameter setting) adjustment. COI has a naturally high intraspecies variability that should be assessed and reported, as it is a source of highly valuable information. We contend that denoising and clustering are not alternatives. Rather, they are complementary and both should be used together in COI metabarcoding pipelines. Using a typical dataset from benthic marine communities, we compared two denoising procedures (based on the UNOISE3 and the DADA2 algorithms), set suitable parameters for denoising and clustering COI datasets, and compared the outcome of applying these processes in different orders. Our results indicate that denoising based on the UNOISE3 algorithm preserves a higher intra-cluster variability. We suggest and test ways to improve this algorithm taking into account the natural variability of each codon position in coding genes. The order of the steps has little influence on the final outcome. We recommend researchers to consider reporting their results in terms of both denoised sequences (a proxy for haplotypes) and clusters formed (a proxy for species), and to avoid collapsing the sequences of the latter into a single representative. This will allow studies at the cluster (ideally equating species-level diversity) and at the intra-cluster level, and will ease additivity and comparability between studies.
e
Combining de novo peptide sequencing algorithms, a synergistic approach to...
ebi.ac.uk
Updated Sep 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernhard Blank-Landeshammer (2019). Combining de novo peptide sequencing algorithms, a synergistic approach to boost both identifications and confidence in bottom-up proteomics [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD005280
Explore at:
Dataset updated
Sep 25, 2019
Authors
Bernhard Blank-Landeshammer
Variables measured
Proteomics
Description
Complex MS-based proteomics datasets are usually analyzed by protein database-searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e. de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than threefold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11,120 PSMs (combined) instead of 3,476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.
Next Generation Sequencing Data Analysis Market Analysis North America,...
technavio.com
pdf
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Next Generation Sequencing Data Analysis Market Analysis North America, Europe, Asia, Rest of World (ROW) - US, Germany, Canada, China, UK - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/next-generation-sequencing-data-analysis-market-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jul 22, 2024
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2024 - 2028
Area covered
Canada, China, Germany, United Kingdom, United States
Description
Snapshot img

Next Generation Sequencing Data Analysis Market Size 2024-2028

The global next generation sequencing data analysis market size is estimated to grow by USD 1.90 billion at a CAGR of 22.58% between 2023 and 2028. The market's growth hinges on several factors, including the escalating demand for personalized medicine, the increasing need for early diagnosis of genetic disorders, and the expanding applications in genomics research. Personalized medicine, tailored to individual genetic makeup, is gaining traction for its targeted and more effective treatment approach. The emphasis on early diagnosis of genetic disorders is driving the demand for advanced genetic testing technologies. Moreover, the broadening applications in genomics research, particularly in understanding genetic mechanisms and disease pathways, are fueling market expansion. These trends collectively highlight the growing significance of genetic testing and personalized medicine in healthcare, underscoring the market's growth trajectory.

What will be the Size of the Next Generation Sequencing Data Analysis Market During the Forecast Period?

To learn more about this report, Request Free Sample

Key Companies & Market Insights

Companies are implementing various strategies, such as strategic alliances, partnerships, mergers and acquisitions, geographical expansion, and product/service launches, to enhance their presence in the market. The report also includes detailed analyses of the competitive landscape of the market and information about key companies, including:

Agilent Technologies Inc., Alphabet Inc., BGI Genomics Co. Ltd., Bio Rad Laboratories Inc., Bionivid Technology Pvt. Ltd., Congenica Ltd., Corewell Health, DNAnexus Inc., DNASTAR Inc., Eurofins Scientific SE, F. Hoffmann La Roche Ltd., Fabric Genomics Inc., Golden Helix Inc., HiberCell Inc., Illumina Inc., Invitae Corp., Macrogen Inc., Oxford Nanopore Technologies plc, Pacific Biosciences of California Inc., Partek Inc., PierianDx Inc., QIAGEN NV, SciGenom Labs Pvt. Ltd., Takara Bio Inc., Thermo Fisher Scientific Inc., and Vela Diagnostics

Qualitative and quantitative analysis of companies has been conducted to help clients understand the wider business environment as well as the strengths and weaknesses of key market players. Data is qualitatively analyzed to categorize companies as pure play, category-focused, industry-focused, and diversified; it is quantitatively analyzed to categorize companies as dominant, leading, strong, tentative, and weak.

Market Segmentation

By End-user

The market share growth by the academic research segment will be significant during the forecast period. The market encompasses DNA sequencing technologies used in genomic science, academic research, and clinical diagnostics. Academic institutions utilize NGS for various applications, such as drug discovery, personalized medicine, and clinical diagnostics.

Get a glance at the market contribution of various segments Download PDF Sample

The academic research segment was valued at USD 221.1 million in 2018. Key drivers include decreasing sequencing costs, user-friendly software, and the demand for precision medicine. NGS enables the analysis of genomic patterns, epigenetics, and biological processes through sequence analysis tools and algorithms. Applications include oncology, genetic research, and tumor genotyping. NGS protocols aid in identifying somatic driver mutations, germline mutations, and resistance mutations. Cancer-related illnesses, financial irregularities, and healthcare professionals benefit from these tools, machine learning techniques, and cloud-based solutions. Additionally, NGS is applied in agriculture, forensics, and genomic studies. Key technologies include Whole-Genome Sequencing, array-based technologies, and clinical. Hence, these factors are expected to drive the market during the forecast period.

By Product

Services play an important role in the market, providing specialized expertise and support to users in analyzing and interpreting their NGS data. The market encompasses various services for Exome Sequencing, Targeted Resequencing, De Novo Sequencing, and Methyl Sequencing. Biotechnology and pharmaceutical companies, along with contract research organizations, utilize these services to analyze and interpret their NGS data. The process involves raw data preprocessing, alignment, variant calling, and annotation, employing advanced tools and algorithms. Service providers ensure accuracy and reliability through quality control measures and optimization of parameters. Technologies like Synthesis (SBS) are integral part. Hence, these factors are expected to drive the growth of the services segment in the market during the forecast period.

Regional Analysis

For more insights about the market share of various regions Download PDF Sample

North America is estimated to contribute 49% to the growth of the global
Examples of sequence alignment with contiguous, binary and ternary seeds
zenodo.org
txt, zip
Updated Feb 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valeriy Titarenko; Valeriy Titarenko; Sofya Titarenko; Sofya Titarenko (2024). Examples of sequence alignment with contiguous, binary and ternary seeds [Dataset]. http://doi.org/10.5281/zenodo.10645042
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10645042
Dataset updated
Feb 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valeriy Titarenko; Valeriy Titarenko; Sofya Titarenko; Sofya Titarenko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classical sequence alignment algorithms use contiguous chunks of symbols to pre-align short sequences (reads) obtained for a studied organism to a long reference sequence. The use of spaced seeds (when we ignore possible differences between two sequences at some positions) allows researchers to improve the sensitivity of alignment algorithms.

In genetics, point mutations have different probabilities. Therefore, it may be reasonable to consider transitional (A <-> G, C <-> T) and transversional (all other) mutations separately.

In perlotSeeds, we consider the alignment of paired-end reads (Han Chinese South, sequence data, ERR016118) with respect to the Human Reference Genome (Human genome assembly GRCh38.p14).

We consider various contiguous seeds, e.g. C32 for the length of 32, and ternary seeds for the given read’s length (76), e.g. T1V2 is a seed to allow one transitional and two transversional mismatches. Then, generate a library of records corresponding to a chosen seed. This library is used to find candidate alignments of all reads.

We provide statistics (InputStat.zip) related to each library generated, i.e. the number of records having the same signature (generated by the seed). There are also output statistics for all reads and chosen seeds, e.g. outputStatT1V3.zip, when we know how many signatures are generated for each read, how many successful alignments can be done and the best score. More detailed information related to several groups of reads can be found in the ExampleOutput.zip file.
Supplementary Datasets for dadasnake workflow
zenodo.org
application/gzip, bin +1
Updated Nov 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Heintz-Buschart; Anna Heintz-Buschart (2020). Supplementary Datasets for dadasnake workflow [Dataset]. http://doi.org/10.5281/zenodo.4181260
Explore at:
bin, tsv, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4181260
Dataset updated
Nov 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anna Heintz-Buschart; Anna Heintz-Buschart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains configuration and results files for the proof-of-principle of the dadasnake pipeline. Includes dadasnake output and tables with the composition of ground-truth data or mock-communities.

dadasnake is a user-friendly, one-command Snakemake pipeline that wraps the pre-processing of sequencing reads and the delineation of exact sequence variants by using the favorably benchmarked and widely-used DADA2 algorithm with a taxonomic classification and the post-processing of the resultant tables, including hand-off in standard formats. The suitability of the provided default configurations is demonstrated using mock-community data from bacteria and archaea, as well as fungi. By use of Snakemake, dadasnake makes efficient use of high-performance computing infrastructures. Easy user configuration guarantees flexibility of all steps, including the processing of data from multiple sequencing platforms. dadasnake facilitates easy installation via conda environments. dadasnake is available at https://github.com/a-h-b/dadasnake .
n
Data from: De novo sequencing and variant calling with nanopores using...
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 3, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tamas Szalay; Jene A. Golovchenko (2015). De novo sequencing and variant calling with nanopores using PoreSeq [Dataset]. http://doi.org/10.5061/dryad.84d4j
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.84d4j
Dataset updated
Sep 3, 2015
Dataset provided by
Harvard University
Authors
Tamas Szalay; Jene A. Golovchenko
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The accuracy of sequencing single DNA molecules with nanopores is continually improving, but de novo genome sequencing and assembly using only nanopore data remain challenging. Here we describe PoreSeq, an algorithm that identifies and corrects errors in nanopore sequencing data and improves the accuracy of de novo genome assembly with increasing coverage depth. The approach relies on modeling the possible sources of uncertainty that occur as DNA transits through the nanopore and finds the sequence that best explains multiple reads of the same region. PoreSeq increases nanopore sequencing read accuracy of M13 bacteriophage DNA from 85% to 99% at 100× coverage. We also use the algorithm to assemble Escherichia coli with 30× coverage and the λ genome at a range of coverages from 3× to 50×. Additionally, we classify sequence variants at an order of magnitude lower coverage than is possible with existing methods.
RNA modification detection using direct RNA sequencing and nanoDoc2
zenodo.org
application/gzip
Updated Jun 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ueda; Ueda (2022). RNA modification detection using direct RNA sequencing and nanoDoc2 [Dataset]. http://doi.org/10.5281/zenodo.6583336
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6583336
Dataset updated
Jun 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ueda; Ueda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The core of nanoDoc2 includes a machine-learning algorithm in which a 6-mer segmented raw current signal is compared by Deep-One-Class classification using a Wavenet-based neural network. As an output, an RNA modification is detected by a statistical score in each candidate position. Herein, we describe the detailed instructions on how to use nanoDoc2 for signal segmentation, train/test the neural network and finally predict RNA modifications present in nanopore direct RNA sequence data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amr Rashed (2022). DNA sequence alignment datasets based on NW algorithm [Dataset]. https://ieee-dataport.org/documents/dna-sequence-alignment-datasets-based-nw-algorithm

DNA sequence alignment datasets based on NW algorithm

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

May 18, 2022

Authors

Amr Rashed

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

namely

Clear search

Close search

Google apps

Main menu

DNA sequence alignment datasets based on NW algorithm

Application of de Novo Sequencing to Large-Scale Complex Proteomics Data...

Data_Sheet_1_SLDMS: A Tool for Calculating the Overlapping Regions of...

sequenceMiner algorithm

Tiling Results of 4 AAV sequencing datasets produced on a Pacific...

Data from: THE INTEGER SEQUENCE A348960

Anomaly Detection in Sequences

Edge-AI Pathogen Sequencer Market Research Report 2033

Edge-AI Pathogen Sequencer Market Outlook

Product Type Analysis</h2&

Supplementary Data for "Mapper: fast and accurate sequence alignment via...

ADDA - Automatic Domain Decomposition Algorithm

Data from: In vivo functional phenotypes from a computational epistatic...

SABmark

Audio outputs of the first note sequence algorithm: LEMorpheus software.

Dataset for "To denoise or to cluster? That is not the question. Optimizing...

Combining de novo peptide sequencing algorithms, a synergistic approach to...

Next Generation Sequencing Data Analysis Market Analysis North America,...

Snapshot img

Examples of sequence alignment with contiguous, binary and ternary seeds

Supplementary Datasets for dadasnake workflow

Data from: De novo sequencing and variant calling with nanopores using...

RNA modification detection using direct RNA sequencing and nanoDoc2

DNA sequence alignment datasets based on NW algorithmSee More Versions

DNA sequence alignment datasets based on NW algorithm