Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret
The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).
This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:
MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:
entry_id: Unique MarFERReT sequence entry identifier.
accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.
marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.
tax_id: The NCBI Taxonomy ID (taxID).
pr2_accession: Best-matching PR2 accession ID associated with entry
pr2_rank: The lowest shared rank between the entry and the pr2_accession
pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession
data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).
data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).
source_link: URL where the original sequence data and/or metadata was collected.
pub_year: Year of data release or publication of linked reference.
ref_link: Pubmed URL directs to the published reference for entry, if available.
ref_doi: DOI of entry data from source, if available.
source_filename: Name of the original sequence file name from the data source.
seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.
n_seqs_raw: Number of sequences in the original sequence file.
source_name: Full organism name from entry source
original_taxID: Original NCBI taxID from entry data source metadata, if available
alias: Additional identifiers for the entry, if available
MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:
entry_id: Unique MarFERReT sequence entry identifier
marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.
tax_id: Verified NCBI taxID used in MarFERReT
taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)
taxID_notes: Notes on the original_taxID
n_seqs_raw: Number of sequences in the original sequence file
n_pfams: Number of Pfam domains identified in protein sequences
qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.
flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.
VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).
flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.
rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.
rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.
flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.
flag_sum: Count of the number of flag columns (qc_flag
, flag_Lasek
, flag_VanVlierberghe
, and flag_rp63
). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).
accepted: Acceptance into the final MarFERReT build (Y or N).
MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).
MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.
The columns in this file contain the following information:
accession: (NA)
accession.version: The unique MarFERReT sequence identifier ('mftX').
taxid: The NCBI Taxonomy ID associated with this reference sequence.
gi: (NA).
MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:
aa_id: the unique identifier for each MarFERReT protein sequence.
entry_id: The unique numeric identifier for each MarFERReT entry.
source_defline: The original, unformatted sequence identifier
MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:
aa_id: The unique MarFERReT protein sequence ID ('mftX').
pfam_name: The shorthand Pfam protein family name.
pfam_id: The Pfam identifier.
pfam_eval: hmm profile match e-value score
pfam_score: hmm profile match bitscore
MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a composite collection of bioactive peptide sequences and Complex Similarity Network (CSN) analysis outputs, designed to explore the functional relationships of 1,872 Secreted Cysteine-Rich peptides/proteins Without Annotation (SCRs-WA). The dataset integrates eight peptide classes, including antimicrobial peptides (AMPs), defensins, venoms/toxins, and non-AMP controls, to establish a reference chemical space for functional inference.
It includes both input sequence data (FASTA format) and CSN-derived output files, which facilitate the visualization and clustering of peptide sequences based on structural and functional similarities:
1- FileSM1: FileSM1_12449_All_8_datasets.fasta 📄 Content:
A FASTA file containing 12,449 peptide sequences across eight datasets: (i) Low-toxicity antimicrobial peptides (AMPs) (ii) Defensins (iii) Animal venoms and toxins (iv) Cytotoxic peptides (v) Haemolytic peptides (vi) Non-AMPs (negative controls) (vii) Cnidarian toxin candidates from S. savaglia (viii) Secreted Cysteine-Rich ORFs Without Annotation (mSCRs-WA)
🔍 Usage: - Serves as the primary input dataset for complex similarity network (CSN) analysis. - Enables homology searches, functional annotation, and comparative analyses.
📤 Output Files from CSN Analysis 2- 🗂 FileSM2: FileSM2_HSPN_Topology_GraphML.zip 📄 Content:
A compressed ZIP file containing GraphML representations of the Half-Space Proximal Network (HSPN): HSPN_clusters_projection.graphml → Clustered projection of peptide connectivity based on similarity metrics. HSPN_peptide_classes_projection.graphml → Projection of peptide classes (AMPs, toxins, defensins, etc.), highlighting their network positioning. 🖥 Visualization:
Can be opened in Gephi v0.10 or any GraphML-compatible tool. Nodes represent peptide sequences, edges indicate functional similarity, and clusters reflect shared bioactivity profiles.
🔍 Usage: - Facilitates visual exploration of sequence relationships. - Enables functional annotation transfer by identifying clusters with known bioactive peptides.
3- 🗂 FileSM3: FileSM3_Clusters_Composition_Analysis.xlsx 📄 Content:
A spreadsheet detailing cluster composition in the HSPN analysis, including: Cluster ID and size Distribution of peptides across eight datasets Functional annotation insights for each cluster
🔍 Usage: - Helps identify key functional groups within the CSN framework. - Provides quantitative insights into peptide distribution and classification.
4- 🗂 FileSM4: FileSM4_HSPN_Connections_Analysis.xlsx 📄 Content:
A spreadsheet detailing functional connections between peptides, including: Pairwise similarity scores Network centrality measures (e.g., harmonic centrality, degree centrality) Annotations of linked sequences
🔍 Usage: - Supports similarity-based functional inference. - Helps track peptide relationships and connectivity patterns within the network.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is the list of all nsLTP sequences used for our work "Comprehensive classification of the plant non-specific lipid transfer protein superfamily towards its Sequence – Structure – Function analysis". It might be useful for any studies in phylogeny (this is the largest dataset used in a phylogeny study to date) or sequence-structure-function relationships. The sequences are named as follows : 3 digit number (from 001 to 800 - from internal classification) nsLTP Type (if available and as stated by initial authors) organism 5 letters code Database identifier (identifier form the database the sequence was retrieved from) Sequence number 526, 600 and 693 were removed during the study (redundancy)
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The data in this file provides essential information about the primary structure of human proteins. The primary structure refers to the specific sequence of amino acids that constitute a protein. These sequences are crucial as they determine the three-dimensional structure and ultimately the function of the protein.
Sequencing human proteins is a fundamental step in understanding their biological functions and the roles they play in various cellular processes. This dataset can be used for a range of applications, including protein structure prediction, functional annotation, and comparative genomics.
Annotations play a crucial role in understanding the roles and behavior of proteins in various biological systems. They provide insights into the potential functions and involvement of proteins in specific pathways or physiological processes. These annotations are often obtained through experimental studies, computational predictions, or curated databases.
The annotations in this file can aid researchers in interpreting the protein sequences and further exploring their functional properties. By linking sequence information to functional annotations, researchers can gain a deeper understanding of the roles these proteins play in human biology, development, and disease.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: The dataset was recorded during milling of 16MnCr5. Due to artificially introduced, though realistic anoma-lies in the workpiece the dataset can be applied for anomaly detection. Furthermore, milling tools with two different diameters where used which led to a dataset eligible for transfer learning. TechnicalRemarks: The dataset consists of seven folders. Each folder represents one milling run. In each milling run the depth of cut was set to 3 mm. A folder contains a maximum of three json files. The number of files depends on the time needed for each run which is a function of milling tool diameter and feed rate. Files in each folder were numerated in sequence. For example, folder “run1” contains the files “run1_1” and “run1_2” with the last number indicating the order in which the files were generated. The frequency of recording datapoints was set to 500 Hz. During each milling run the milling tool moved along the longitudinal side and then was moved back alongside the workpiece. This way machining started always on the same side of the workpiece. Table 1 provides an overview of the milling runs. Run 1 to 4 were performed with a HSS tool with a diameter of 10 mm. The tool in use was an end mill (HSS-E-SPM HPC 10 mm) developed by Hoffmann Group. During the first three runs with this end mill no tool breakage occurred. However, in run 4 the tool broke. Runs 5 and 6 were performed by milling with an end mill of the same tool series (HSS-E-SPM HPC 8 mm) that just differs in tool diameter. In contrast to this run 7 was performed by using a solid carbid tool (Solid carbide roughing end mill HPC 8 mm). Cutting with SC tools provides much higher productivity with the downside being higher tool price. In our case the SC end mill performed cuts with a feed rate of 1150 mm/min compared to 191 mm/min achieved by a HSS end mill of the same diameter. Tool breakages were recorded on all runs with end mills of diameter 8 mm. Table 1. overview of the data folders folder name | number of json files | tool diameter | tool breakage | tool type run 1 2 10 mm No HSS run 2 2 10 mm No HSS run 3 2 10 mm No HSS run 4 2 10 mm Yes HSS run 5 2 8 mm Yes HSS run 6 3 8 mm Yes HSS run 7 1 8 mm Yes SC Each json file consists of a header and a payload. The header lists all parameters that were recorded such as position, motor torque and motor current of each of a maximum of five axes of a milling machine. However, the machine used in our experiments is a 3-axis machining center which leaves the payload of 2 possible additional axes to be empty. In the payload the sequential data for each parameter can be found. A list of recorded signals can be found in Table 2. Table 2. recorded signals during milling Signal index in payload | Signal name | Signal Address |Type 13-18 VelocityFeedForward VEL_FFW|1 double 19-24 Power POWER|1 string 25-30 CountourDeviation CONT_DEV|1 double 38-43 TorqueFeedForward TORQUE_FFW|1 double 44-49 Encoder1Position ENC1_POS|1 double 56-61 Load LOAD|1 double 68-73 Torque TORQUE|1 double 68-91 Current CURRENT|1 double 1 represents x-axis, 2 represents y-axis, 3 represents z-axis and 6 represents spindle-axis. Note that our milling center has 3 axis and therefore values for axes 4 and 5 are null.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The provided code processes a Tajweed dataset, which appears to be a collection of audio recordings categorized by different Tajweed rules (Ikhfa, Izhar, Idgham, Iqlab). Let's break down the dataset's structure and the code's functionality:
Dataset Structure:
Code Functionality:
Initialization and Imports: The code begins with necessary imports (pandas, pydub) and mounts Google Drive. Pydub is used for audio file format conversion.
Directory Listing: It initially checks if a specified directory exists (for example, Alaa_alhsri/Ikhfa) and lists its files, demonstrating basic file system access.
Metadata Creation: The core of the script is the generation of metadata, which provides essential information about each audio file. The tajweed_paths
dictionary maps each Tajweed rule to a list of paths, associating each path with the reciter's name.
global_id
: A unique identifier for each audio file.original_filename
: The original filename of the audio file.new_filename
: A standardized filename that incorporates the Tajweed rule (label), sheikh's ID, audio number, and a global ID.label
: The Tajweed rule.sheikh_id
: A numerical identifier for each sheikh.sheikh_name
: The name of the reciter.audio_number
: A sequential number for the audio files within a specific sheikh and Tajweed rule combination.original_path
: Full path to the original audio file.new_path
: Full path to the intended location for the renamed and potentially converted audio file.File Renaming and Conversion:
new_filename
and store it in the designated directory..wav
format, creating standardized files in a new output_dataset
directory. The new filenames are based on rules, sheikh and a counter.Metadata Export: Finally, the compiled metadata is saved as a CSV file (metadata.csv
) in the output directory. This CSV file is crucial for training any machine learning model using this data.
This repository contains the dataset used in the publication "Turnover of strain-level diversity modulates functional traits in the honeybee gut microbiome between nurses and foragers," which is currently under revision. A pre-print can be found here. The database is based on previously published work to create a genomic database of honeybee gut microbes by Kirsten Ellegaard (2021), found here. The zipped folder deposited here after unzipping, should contain the following files and directories: honeybee_genome.fasta : fasta file containing the host (Apis mellifera) genome sequence beebiome_db : fasta file of 198 concatenated genomes with one genome per entry (multi-line fasta) where the headers represent the genome identifier beebiome_red_db : fasta file of 39 species representative genomes with one genome per entry (multi-line fasta) where the headers represent the genome identifier to be used for the analysis of intra-specific variation fna_files : directory containing genome sequence files and concatenated files where the concatenated files contain one fasta entry renamed to the genome identifier and all contigs concatenated into one entry ffn_files : directory containing one file per genome listing the nucleotide sequence of all the predicted genes faa_files : directory containing one file per genome listing the amino acid sequence of all the predicted genes bed_files : directory containing bed files where the location of each of the predicted genes are indicated based on their position in the concatenated genome file single_ortho : directory containing one file per phylotype listing all the single-copy orthogroups (OGs) identified by orthofinder where each line represents an OG id followed by a list of genes from each of the genomes of that phylotype that belong to that OG and the corresponding sequences of these genes can be found in the ffn file belonging to the respective genome red_bed_files : directory containing bed files for species representative genomes that only list the positions genes that belong to the core orthogroups of their phylotype Further information about how this genome database was used to analyze strain-level diversity can be found in the publication and accompanying code repository.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The MAISON-LLF dataset was collected from 10 older adult participants living alone in the community following lower limb fractures. Each participant contributed data for over 8 weeks, beginning from their first-week post-discharge. This resulted in a total of 560 days of continuous multimodal sensor data, complemented by biweekly clinical questionnaire data.
The MAISON-LLF dataset is organized into a directory tree, as shown below.
maison-llf/
├── sensor-data/
│ ├── p01/
.
.
.
│ ├── p10/
│ │ ├── acceleration-data.csv
│ │ ├── heartrate-data.csv
│ │ ├── motion-data.csv
│ │ ├── position-data.csv
│ │ ├── sleep-data.csv
│ │ ├── step-data.csv
├── features/
│ ├── p01/
.
.
.
│ ├── p10/
│ │ ├── acceleration-features.csv
│ │ ├── heartrate-features.csv
│ │ ├── motion-features.csv
│ │ ├── position-features.csv
│ │ ├── sleep-features.csv
│ │ ├── step-features.csv
│ │ ├── clinical.csv
├── dataset/
│ ├── all-features.csv
│ ├── all-features-imputed.csv
│ ├── dataset-daily.pt
│ ├── dataset-weekly.pt
│ ├── dataset-biweekly.pt
│ ├──
In ‘sensor-data’ folder, the dataset includes 60 CSV files containing data from six sensor types for 10 participants. Each file includes a ‘timestamp’ column indicating the date and time of the recorded sensor data, accurate to milliseconds (‘yyyy-MM-dd HH:mm:ss.SSS’), along with the corresponding sensor measurements. For instance, the ‘acceleration-data.csv’ files include four columns: timestamp, and x, y, and z coordinates, while the ‘heartrate-data.csv’ files contain two columns: timestamp and heart rate value.
The dataset also includes 70 CSV files containing daily features extracted from the sensor data, along with clinical questionnaire data and physical test results. Each feature CSV file includes a timestamp column representing the date (‘yyyy-MM-dd’) of the sensor data from which the daily features were extracted, alongside the corresponding sensor features. For example, the ‘acceleration-features.csv’ files contain eight columns: timestamp and the seven acceleration features and the ‘heartrate-features.csv’ files include five columns: timestamp and the four heart rate features. Additionally, the ‘clinical.csv’ files provide values for individual items of the SIS (‘sis-01’ to ‘sis-06’), OHS (‘ohs-01’ to ‘ohs-12’), and OKS (‘oks-01’ to ‘oks-12’) questionnaires, along with their final scores (‘sis’, ‘ohs’, and ‘oks’). These files also include results for the TUG and 30-second chair stand tests. Each participant has four sets of clinical data, with each set sharing the same ‘timestamp’ corresponding to the date (‘yyyy-MM-dd’) on which the clinical data were collected.
To provide a comprehensive overview of the dataset, the ‘all-features.csv’ and ‘all-features-imputed.csv’ files in ‘dataset’ folder combine all daily features, clinical data, and demographic information into single CSV files, representing the data before and after missing value imputation (as explained in subsection 2.2.4). Additionally, the Python PyTorch files are structured datasets designed to facilitate supervised and unsupervised machine learning model development for estimating clinical outcomes.
‘dataset-daily.pt’ in ‘dataset’ folder contains a NumPy array with dimensions num_days × num_features, representing the daily features for all 10 participants. Alongside this array, it includes a num_days IDs array that maps each day to a participant (IDs 1 to 10). Additionally, the file contains three separate num_days arrays for SIS, OHS, and OKS scores, each assigned to the corresponding days in the daily features array.
‘dataset-weekly.pt’ in ‘dataset’ folder provides an array with dimensions num_weeks × 7 × num_features, which includes the weekly sequential features for all participants. This file also includes a num_weeks IDs array to identify the participant (1 to 10) associated with each week in the samples array. Similar to the daily dataset, it contains three separate num_weeks arrays for the SIS, OHS, and OKS scores, each assigned to the respective weeks in the weekly features array.
‘dataset-biweekly.pt’ in ‘dataset’ folder provides an array with dimensions num_biweeks × 14 × num_features, which includes the biweekly sequential features for all participants. This file also includes a num_biweeks IDs array to identify the participant (1 to 10) associated with each biweekly period in the samples array. Similar to the daily dataset, it contains three separate num_biweeks arrays for the SIS, OHS, and OKS scores, each assigned to the respective biweekly periods in the biweekly features array.
Citation
Cite the related pre-print:
A. Abedi, C. H. Chu, and S. S. Khan, "Multimodal Sensor Dataset for Remote Monitoring of Older Adults Post Lower-Limb Fractures in the Community,"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In vitro study of the effects on the human gut microbiota of lemon pectins with two different molecular weights and having varying degrees of esterification. Data collected following incubations include: amplicon sequencing of the V1-V2 regions of the 16S rRNA gene (found in the NCBI Sequence Read Archive associated with BioProject PRJNA903836: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA903836), RT-qPCR of Bifidobaterium sp. 16S rRNA genes, and short-chain fatty acid concentrations. Resources in this dataset:
Resource Title: Short Chain Fatty Acid concentrations File Name: pectin_SCFA_data.csv
Resource Title: Pectin_total_bacterial_qPCR_data File Name: pectin_total_bacterial_qPCR_data.csv
Resource Title: Pectin_bifidobacterium_genus_qPCR_data File Name: pectin_bifidobacterium_genus_qPCR_data.csv
Resource Title: Sample metadata
File Name: pectin_sample_attributes.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of two parts: XPCS and DLS experimental data. The XPCS experimental data was collected on SSRF 10U USAXS beam lines in March 2023, and the experimental parameters are as follows: The X-ray energy is 10 KeV; The distance between the sample and detector is 27.6 meters; The detector is EIGER X 4M, with a single pixel size of 75 microns; The experimental sample was a colloidal glycerol suspension. The water in the original solution (McLean, M814153) was replaced with glycerol using a rotary evaporator, resulting in a volume fraction of 1%. By using partially coherent X-ray, speckle patterns were collected at different exposure periods. The correlation function of the sample at different exposure periods and q values can be obtained through autocorrelation function. Data file name explanation: SiO2_ 500_ N1000_ 100ms_ GL_ ICT1_ 4165 is a colloid (SiO2)_ Particle size in nm_ Number of frames collected (N+number)_ Exposure period_ Medium (glycerol)_ Pinhole (100 microns)_ Data sequence number. The DLS experimental data was collected using the SSRF ancillary laboratories in July 2023 using the DLS device. The experimental parameters are as follows: Laser wavelength: 633 nm; Scattering angle: 90 °; Experimental temperature: 23.7 ℃; Dilute the original solution with pure water to different concentrations, and then place it in a quartz colorimetric dish for detection. The scattered signal is received by PMT and correlated with the correlation instrument (ALV-7004/USB-FAST) to obtain the results. We collected correlation signals at different concentrations and analyzed the impact of multiple scattering of colloids on particle size. Data file name explanation: SiO2_ 500_ X10_ 10sX10_ 2 is colloid (SiO2)_ Particle size in nm_ Dilution ratio (concentration of 1%/10)_ Collection cycle X times_ Data sequence number.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Tables (1 to 3). (XLS 32 kb)
Amplicon sequencing utilizing next-generation platforms has significantly transformed how research is conducted, specifically microbial ecology. However, primer and sequencing platform biases can confound or change the way scientists interpret these data. The Pacific Biosciences RSII instrument may also preferentially load smaller fragments, which may also be a function of PCR product exhaustion during sequencing. To further examine theses biases, data is provided from 16S rRNA rumen community analyses. Specifically, data from the relative phylum-level abundances for the ruminal bacterial community are provided to determine between-sample variability. Direct sequencing of metagenomic DNA was conducted to circumvent primer-associated biases in 16S rRNA reads and rarefaction curves were generated to demonstrate adequate coverage of each amplicon. PCR products were also subjected to reduced amplification and pooling to reduce the likelihood of PCR product exhaustion during sequencing on the Pacific Biosciences platform. The taxonomic profiles for the relative phylum-level and genus-level abundance of rumen microbiota as a function of PCR pooling for sequencing on the Pacific Biosciences RSII platform were provided. Data is within this article and raw ruminal MiSeq sequence data is available from the NCBI Sequence Read Archive (SRA Accession SRP047292). Additional descriptive information is associated with NCBI BioProject PRJNA261425. http://www.ncbi.nlm.nih.gov/bioproject/PRJNA261425/ Resources in this dataset:Resource Title: NCBI Sequence Read Archive (SRA Accession SRP047292). File Name: Web Page, url: https://www.ncbi.nlm.nih.gov/sra/SRX704260 1 ILLUMINA (Illumina MiSeq) run: 978,195 spots, 532.9M bases, 311.6Mb downloads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset we collected for the 'Using Sequence-to-Sequence Learning for Repairing C Vulnerabilities' paper. See the description in the paper for how the dataset was collected. Please cite 'Using Sequence-to-Sequence Learning for Repairing C Vulnerabilities' if you use the dataset.
src-all.txt and tgt-all.txt contain the tokenized function pairs and are ready to used as training data. Each line in both txt file corresponds to a function before and after a commit that was classified as a bug fix commit.
The two tar files contain the raw data that was used to generate both txt files. Both containing the commits that were collected during the respective year.
The dataset consists of seven folders. Each folder represents one milling run. In each milling run the depth of cut was set to 3 mm. A folder contains a maximum of three json files. The number of files depends on the time needed for each run which is a function of milling tool diameter and feed rate. Files in each folder were numerated in sequence. For example, folder “run1” contains the files “run1_1” and “run1_2” with the last number indicating the order in which the files were generated. The frequency of recording datapoints was set to 500 Hz. During each milling run the milling tool moved along the longitudinal side and then was moved back alongside the workpiece. This way machining started always on the same side of the workpiece. Table 1 provides an overview of the milling runs. Run 1 to 4 were performed with a HSS tool with a diameter of 10 mm. The tool in use was an end mill (HSS-E-SPM HPC 10 mm) developed by Hoffmann Group. During the first three runs with this end mill no tool breakage occurred. However, in run 4 the tool broke. Runs 5 and 6 were performed by milling with an end mill of the same tool series (HSS-E-SPM HPC 8 mm) that just differs in tool diameter. In contrast to this run 7 was performed by using a solid carbid tool (Solid carbide roughing end mill HPC 8 mm). Cutting with SC tools provides much higher productivity with the downside being higher tool price. In our case the SC end mill performed cuts with a feed rate of 1150 mm/min compared to 191 mm/min achieved by a HSS end mill of the same diameter. Tool breakages were recorded on all runs with end mills of diameter 8 mm. Table 1. overview of the data folders folder name | number of json files | tool diameter | tool breakage | tool type run 1 2 10 mm No HSS run 2 2 10 mm No HSS run 3 2 10 mm No HSS run 4 2 10 mm Yes HSS run 5 2 8 mm Yes HSS run 6 3 8 mm Yes HSS run 7 1 8 mm Yes SC Each json file consists of a header and a payload. The header lists all parameters that were recorded such as position, motor torque and motor current of each of a maximum of five axes of a milling machine. However, the machine used in our experiments is a 3-axis machining center which leaves the payload of 2 possible additional axes to be empty. In the payload the sequential data for each parameter can be found. A list of recorded signals can be found in Table 2. Table 2. recorded signals during milling Signal index in payload | Signal name | Signal Address |Type 13-18 VelocityFeedForward VEL_FFW|1 double 19-24 Power POWER|1 string 25-30 CountourDeviation CONT_DEV|1 double 38-43 TorqueFeedForward TORQUE_FFW|1 double 44-49 Encoder1Position ENC1_POS|1 double 56-61 Load LOAD|1 double 68-73 Torque TORQUE|1 double 68-91 Current CURRENT|1 double 1 represents x-axis, 2 represents y-axis, 3 represents z-axis and 6 represents spindle-axis. Note that our milling center has 3 axis and therefore values for axes 4 and 5 are null.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The POLARIS dataset is built from a decade of polarimetric observations (2014–2024) conducted with the SPHERE instrument on the Very Large Telescope (VLT). Specifically, it includes all public polarized light observations obtained using the IRDIS instrument, retrieved from the ESO Science Archive. These raw observations were uniformly preprocessed using a modified version of the IRDAP pipeline to generate high-quality Polarimetric Differential Imaging (PDI) products.
The dataset consists of three main components:
96 labeled PDI-postprocessed polarimetric images (1024 × 1024 pixels), each annotated as either a target (with circumstellar disk structures) or a reference (with no detectable disk structures). This subset is approximately 3.18 GB in size.
813 unlabeled PDI-postprocessed polarimetric images, each derived from sequences of preprocessed exposures in total intensity light ( 2014-2023) . These samples are also annotated with vegetation indices and land-use metadata. This component occupies approximately GB. The PDI-postprocessed polarimetric images for 2024 will be updated soon with new version. There will be total 921 unlabeld polarized data.
206 RDI preprocessed exposure sequences used for downstream imputation, each corresponding to a labeled reference and composed of the original preprocessed exposures in total intensity light. The data is organized by year, with each archive file named according to its corresponding year. Each sequence contains 4n images (where n is the number of exposure cycles), with a resolution of 1024 × 1024 pixels per frame. This component totals approximately 38 GB (2014-2024).
All files are provided in standard .fits
format, following astronomical data conventions. The labeled PDI images support supervised learning tasks such as classification or domain adaptation, while the exposure sequences and unlabeled samples enable studies in imputation, denoising, self-supervised learning, or contrastive representation learning. The dataset will continue to expand as additional SPHERE observations are released to the public.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract:
In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.
General Information:
This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.
You may find details of this dataset from the original paper:
Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".
If you use the data, implementation, or any details of the paper, please cite!
BIBTEX:
_
@inproceedings{nedelkoski2020multi, title={Multi-source Distributed System Data for AI-Powered Analytics}, author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej}, booktitle={European Conference on Service-Oriented and Cloud Computing}, pages={161--176}, year={2020}, organization={Springer} }
_
The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.
The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.
Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.
Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Task: 24 subjects watched a complete 25-minute black-and-white television episode (the Twilight Zone, https://en.wikipedia.org/wiki/The_Lateness_of_the_Hour). They had no instructions other than to attend to the movie. An eyetracking calibration period preceded the movie; the movie begins at TR=15.
Image acquisition: MRI data were collected on a 3T full-body scanner (Siemens Skyra) with a 16-channel head coil. Functional images were acquired using a T2*-weighted echo planar imaging (EPI) pulse sequence (TR 1500 ms, TE 28 ms, flip angle 64, whole-brain coverage 27 slices of 4 mm thickness, in-plane resolution 3 x 3 mm2, FOV 192 x 192 mm2), ascending interleaved. Anatomical images were acquired using a T1-weighted MPRAGE pulse sequence (0.89 mm3 resolution).
Image pre-processing: The first two volumes of each functional run were discarded for T1 equilibration. Preprocessing was performed in FSL, including slice time correction, motion correction, linear detrending, high-pass filtering (140 s cutoff), and coregistration and affine transformation of the functional volumes to a template brain (MNI). Functional images were resampled to 3 mm isotropic voxels.
File descriptions: sub# refers to set of individual subject data. There are 24 subjects and thus 24 files.
Dataset for the study: Schöpper, L. M., & Frings, C. (2022). Same, but different: Binding effects in auditory, but not visual detection performance. Attention, Perception, & Psychophysics, 1-14. https://doi.org/10.3758/s13414-021-02436-5 For further information please refer to the aforementioned paper. The aggregated data files can be analyzed by using the respective SPSS-Syntax available under "Code for: Same, but different: Binding effects in auditory, but not visual detection performance" to perform the analyses reported in the paper. Responding to a stimulus leads to the integration of response and stimulus’ features into an event file. Upon repetition of any of its features, the previous event file is retrieved, thereby affecting ongoing performance. Such integration-retrieval explanations exist for a number of sequential tasks (that measure these processes as ’binding effects’) and are thought to underlie all actions. However, based on attentional orienting literature, Schöpper, Hilchey, et al. (2020) could show that binding effects are absent when participants detect visual targets in a sequence: In visual detection performance, there is simply a benefit for target location changes (inhibition of return). In contrast, Mondor and Leboe (2008) had participants detect auditory targets in a sequence, and found a benefit for frequency repetition – presumably reflecting a binding effect in auditory detection performance. In the current study, we conducted two experiments, that only differed in the modality of the target: Participants signaled the detection of a sound (N = 40) or of a visual target (N = 40). Whereas visual detection performance showed a pattern incongruent with binding assumptions, auditory detection performance revealed a non-spatial feature repetition benefit, suggesting that frequency was bound to the response. Cumulative reaction time distributions indicated that the absence of a binding effect in visual detection performance was not caused by overall faster responding. The current results show a clear limitation to binding accounts in action control: Binding effects are not only limited by task demands, but can entirely depend on target modality.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data file 1Title: Data File PROSITE_positives_PS000125.fasta.Legend: Sequence file in FASTA format of all positive examples for the ser/thr phosphatase model. Data file 2Data File PROSITE_negatives_PS000125.fasta.Sequence file in FASTA format of all randomly selected negative examples for the ser/thr phosphatase model." Data file 3Data File PROSITE_positives_PS00028.fasta.Sequence file in FASTA format of all positive examples for the zinc finger model. Data file 4Data File PROSITE_negatives_PS00028.fasta.Sequence file in FASTA format of all randomly selected negative examples for the zinc finger model. Data file 5Data File PROSITE_PS00125.txt.PROSITE record used for the ser/thr phosphatase model. Data file 6Data File PROSITE_PS00028.txt.PROSITE record used for the zinc finger model. Data file 7Data File MDR_TCDB_positives.fasta.Sequence file of MDR transporters used for training. FASTA format file of positive examples used in this study derived from the TCDB. Data file 8Data File MDR_TCDB_negatives.fasta.Sequence file of non-MDR transporters used for training. FASTA format file of negative examples used in this study derived from the TCDB. Data file 9Data File PILGram_PATTERNS_PS00125.txt.Regular expression generated by PILGram for the ser/thr phosphatase model. Data file 10Data File PS00125_alignments.out.Sequence alignments of PILGram model matches to the positive examples in the ser/thr phosphatase model. Data file 11Data File PILGram_PATTERNS_PS00028.txt.Regular expressions generated by PILGram for the zinc finger model. Data file 12Data File PS00028_alignments.out.Sequence alignments of PILGram model matches to the positive examples in the zinc finger model and a summary score line that represents the overlap of the 10 different models for each sequence. Data file 13Data File PILGram_PATTERNS_MDRpred.txt.The 36 regular expressions and associated physiochemical properties (where applicable) generated by PILGram for the MDR model . Data file 14Data File MDRpred_alignments.out.Alignments of 36 PILGram model matches on the MDR positive example sequences. Data file 15Data File Pfam_transporters.txt.A list of Pfam families that were used to identify transporters in the Hot Lake metagenome. Data file 16Data File HotLake_MDRpred_predictions.fasta.A FASTA format file of 63 protein sequences from the Hot Lake metagenome that are matched by 30 or more MDRpred individual models (high confidence predictions), match Pfam families for transporters (Pfam e-value less than 1e-20), but are not identified by Pfam as multidrug resistance transporters.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A comprehensive dataset characterizing healthy research volunteers in terms of clinical assessments, mood-related psychometrics, cognitive function neuropsychological tests, structural and functional magnetic resonance imaging (MRI), along with diffusion tensor imaging (DTI), and a comprehensive magnetoencephalography battery (MEG).
In addition, blood samples are currently banked for future genetic analysis. All data collected in this protocol are broadly shared in the OpenNeuro repository, in the Brain Imaging Data Structure (BIDS) format. In addition, task paradigms and basic pre-processing scripts are shared on GitHub. This dataset is unprecedented in its depth of characterization of a healthy population and will allow a wide array of investigations into normal cognition and mood regulation.
This dataset is licensed under the Creative Commons Zero (CC0) v1.0 License.
This release includes data collected between 2020-06-03 (cut-off date for v1.0.0) and 2024-04-01. Notable changes in this release:
visit
and age_at_visit
columns added to phenotype files to distinguish between visits and intervals between them.See the CHANGES file for complete version-wise changelog.
To be eligible for the study, participants need to be medically healthy adults over 18 years of age with the ability to read, speak and understand English. All participants provided electronic informed consent for online pre-screening, and written informed consent for all other procedures. Participants with a history of mental illness or suicidal or self-injury thoughts or behavior are excluded. Additional exclusion criteria include current illicit drug use, abnormal medical exam, and less than an 8th grade education or IQ below 70. Current NIMH employees, or first degree relatives of NIMH employees are prohibited from participating. Study participants are recruited through direct mailings, bulletin boards and listservs, outreach exhibits, print advertisements, and electronic media.
All potential volunteers visit the study website, check a box indicating consent, and fill out preliminary screening questionnaires. The questionnaires include basic demographics, the World Health Organization Disability Assessment Schedule 2.0 (WHODAS 2.0), the DSM-5 Self-Rated Level 1 Cross-Cutting Symptom Measure, the DSM-5 Level 2 Cross-Cutting Symptom Measure - Substance Use, the Alcohol Use Disorders Identification Test (AUDIT), the Edinburgh Handedness Inventory, and a brief clinical history checklist. The WHODAS 2.0 is a 15 item questionnaire that assesses overall general health and disability, with 14 items distributed over 6 domains: cognition, mobility, self-care, “getting along”, life activities, and participation. The DSM-5 Level 1 cross-cutting measure uses 23 items to assess symptoms across diagnoses, although an item regarding self-injurious behavior was removed from the online self-report version. The DSM-5 Level 2 cross-cutting measure is adapted from the NIDA ASSIST measure, and contains 15 items to assess use of both illicit drugs and prescription drugs without a doctor’s prescription. The AUDIT is a 10 item screening assessment used to detect harmful levels of alcohol consumption, and the Edinburgh Handedness Inventory is a systematic assessment of handedness. These online results do not contain any personally identifiable information (PII). At the conclusion of the questionnaires, participants are prompted to send an email to the study team. These results are reviewed by the study team, who determines if the participant is appropriate for an in-person interview.
Participants who meet all inclusion criteria are scheduled for an in-person screening visit to determine if there are any further exclusions to participation. At this visit, participants receive a History and Physical exam, Structured Clinical Interview for DSM-5 Disorders (SCID-5), the Beck Depression Inventory-II (BDI-II), Beck Anxiety Inventory (BAI), and the Kaufman Brief Intelligence Test, Second Edition (KBIT-2). The purpose of these cognitive and psychometric tests is two-fold. First, these measures are designed to provide a sensitive test of psychopathology. Second, they provide a comprehensive picture of cognitive functioning, including mood regulation. The SCID-5 is a structured interview, administered by a clinician, that establishes the absence of any DSM-5 axis I disorder. The KBIT-2 is a brief (20 minute) assessment of intellectual functioning administered by a trained examiner. There are three subtests, including verbal knowledge, riddles, and matrices.
Biological and physiological measures are acquired, including blood pressure, pulse, weight, height, and BMI. Blood and urine samples are taken and a complete blood count, acute care panel, hepatic panel, thyroid stimulating hormone, viral markers (HCV, HBV, HIV), c-reactive protein, creatine kinase, urine drug screen and urine pregnancy tests are performed. In addition, three additional tubes of blood samples are collected and banked for future analysis, including genetic testing.
Participants were given the option to enroll in optional magnetic resonance imaging (MRI) and magnetoencephalography (MEG) studies.
On the same visit as the MRI scan, participants are administered a subset of tasks from the NIH Toolbox Cognition Battery. The four tasks asses attention and executive functioning (Flanker Inhibitory Control and Attention Task), executive functioning (Dimensional Change Card Sort Task), episodic memory (Picture Sequence Memory Task), and working memory (List Sorting Working Memory Task). The MRI protocol used was initially based on the ADNI-3 basic protocol, but was later modified to include portions of the ABCD protocol in the following manner:
The optional MEG studies were added to the protocol approximately one year after the study was initiated, thus there are relatively fewer MEG recordings in comparison to the MRI dataset. MEG studies are performed on a 275 channel CTF MEG system. The position of the head was localized at the beginning and end of the recording using three fiducial coils. These coils were placed 1.5 cm above the nasion, and at each ear, 1.5 cm from the tragus on a line between the tragus and the outer canthus of the eye. For some participants, photographs were taken of the three coils and used to mark the points on the T1 weighted structural MRI scan for co-registration. For the remainder of the participants, a BrainSight neuro-navigation unit was used to coregister the MRI, anatomical fiducials, and localizer coils directly prior to MEG data acquisition.
NOTE: In the release 2.0 of the dataset, two measures Brief Trauma Questionnaire (BTQ) and Big Five personality survey were added to the online screening questionnaires. Also, for the in-person screening visit, the Beck Anxiety Inventory (BAI) and Beck Depression Inventory-II (BDI-II) were replaced with the General Anxiety Disorder-7 (GAD7) and Patient Health Questionnaire 9 (PHQ9) surveys, respectively. The Perceived Health rating survey was discontinued.
Survey or Test | BIDS TSV Name |
---|---|
Alcohol Use Disorders Identification Test (AUDIT) | audit.tsv |
Brief Trauma Questionnaire (BTQ) | btq.tsv |
Big-Five Personality | big_five_personality.tsv |
Demographics | demographics.tsv |
Drug Use Questionnaire |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of the Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. We gathered 902 marine eukaryote genomes and transcriptomes from multiple sources and assessed these candidate entries for sequence quality and cross-contamination issues, selecting 800 validated entries for inclusion in the library. MarFERReT v1 contains reference sequences from 800 marine eukaryotic genomes and transcriptomes, covering 453 species- and strain-level taxa, totaling nearly 28 million protein sequences with associated NCBI and PR2 Taxonomy identifiers and Pfam functional annotations. An accompanying MarFERReT project repository hosts containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT.MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret
The raw source data for the 902 candidate entries considered for MarFERReT v1.1.1, including the 800 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.1.1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link).
This repository release contains MarFERReT database files from the v1.1.1 MarFERReT release using the following MarFERReT library build scripts: assemble_marferret.sh, pfam_annotate.sh, and build_diamond_db.shThe following MarFERReT data products are available in this repository:
MarFERReT.v1.1.1.metadata.csvThis CSV file contains descriptors of each of the 902 database entries, including data source, taxonomy, and sequence descriptors. Data fields are as follows:
entry_id: Unique MarFERReT sequence entry identifier.
accepted: Acceptance into the final MarFERReT build (Y/N). The Y/N values can be adjusted to customize the final build output according to user-specific needs.
marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name; maintaining strain-level designation wherever possible.
tax_id: The NCBI Taxonomy ID (taxID).
pr2_accession: Best-matching PR2 accession ID associated with entry
pr2_rank: The lowest shared rank between the entry and the pr2_accession
pr2_taxonomy: PR2 Taxonomy classification scheme of the pr2_accession
data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT).
data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP)17, NCBI GenBank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or entries from the Roscoff Culture Collection through the METdb database repository (METdb).
source_link: URL where the original sequence data and/or metadata was collected.
pub_year: Year of data release or publication of linked reference.
ref_link: Pubmed URL directs to the published reference for entry, if available.
ref_doi: DOI of entry data from source, if available.
source_filename: Name of the original sequence file name from the data source.
seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets.
n_seqs_raw: Number of sequences in the original sequence file.
source_name: Full organism name from entry source
original_taxID: Original NCBI taxID from entry data source metadata, if available
alias: Additional identifiers for the entry, if available
MarFERReT.v1.1.1.curation.csvThis CSV file contains curation and quality-control information on the 902 candidate entries considered for incorporation into MarFERReT v1, including curated NCBI Taxonomy IDs and entry validation statistics. Data fields are as follows:
entry_id: Unique MarFERReT sequence entry identifier
marferret_name: Organism name in human and machine friendly format, including additional NCBI taxonomy strain identifiers if available.
tax_id: Verified NCBI taxID used in MarFERReT
taxID_status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged)
taxID_notes: Notes on the original_taxID
n_seqs_raw: Number of sequences in the original sequence file
n_pfams: Number of Pfam domains identified in protein sequences
qc_flag: Early validation quality control flags for the following: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations.
flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson (2019); contains the flag 'FLAG_LASEK' indicating ciliate samples reported as contaminated in this study.
VV_contam_pct: Estimated contamination reported for MMETSP entries in Van Vlierberghe et al., (2021).
flag_VanVlierberghe: Flag for a high level of estimated contamination, from 'flag_VanVlierberghe' values over 50%: FLAG_VV.
rp63_npfams: Number of ribosomal protein Pfam domains out of 63 total.
rp63_contam_pct: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, as described in the Technical Validation section from analysis of 63 Pfam ribosomal protein domains.
flag_rp63: Flag for a high level of estimated contamination, from 'rp63_contam_pct' values over 50%: FLAG_RP63.
flag_sum: Count of the number of flag columns (qc_flag
, flag_Lasek
, flag_VanVlierberghe
, and flag_rp63
). All entries with one or more flag are nominally rejected ('accepted' = N); entries without any flags are validated and accepted ('accepted' = Y).
accepted: Acceptance into the final MarFERReT build (Y or N).
MarFERReT.v1.1.1.proteins.faa.gzThis Gzip-compressed FASTA file contains the 27,951,013 final translated and clustered protein sequences for all 800 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where 'X' is a ten-digit integer value).
MarFERReT.v1.1.1.taxonomies.tab.gzThis Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. Note that 'accession.version' and 'taxid' are populated columns while 'accession' and 'gi' have NA values; the latter columns are required for back-compatibility as input for the DIAMOND alignment software and LCA analysis.
The columns in this file contain the following information:
accession: (NA)
accession.version: The unique MarFERReT sequence identifier ('mftX').
taxid: The NCBI Taxonomy ID associated with this reference sequence.
gi: (NA).
MarFERReT.v1.1.1.proteins_info.tab.gzThis Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns:
aa_id: the unique identifier for each MarFERReT protein sequence.
entry_id: The unique numeric identifier for each MarFERReT entry.
source_defline: The original, unformatted sequence identifier
MarFERReT.v1.1.1.best_pfam_annotations.csv.gzThis Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 800 validated MarFERReT entries; derived from the hmmsearch annotations against Pfam 34.0 functional domains. This file contains the following fields:
aa_id: The unique MarFERReT protein sequence ID ('mftX').
pfam_name: The shorthand Pfam protein family name.
pfam_id: The Pfam identifier.
pfam_eval: hmm profile match e-value score
pfam_score: hmm profile match bitscore
MarFERReT.v1.1.1.dmndThis binary file is the indexed database of the MarFERReT protein library with embedded NCBI taxonomic information generated by the DIAMOND makedb tool using the build_diamond_db.sh script from the MarFERReT /scripts/ library. This can be used as the reference DIAMOND database for annotating environment sequences from eukaryotic metatranscriptomes.