Skip to content

CLIMB-TRE

Analysis Specification

CLIMB-TRE/climb-tre.github.io

mSCAPE Analysis Specification¶

Fields¶

Field	Data type	Description	Restrictions
`published_date`	`date`	The date the object was published in Onyx.	• Output format: `iso-8601`
`site`	`choice`	The site or sequencing centre providing the data.	• Choices: `barts`, `bham`, `cuh`, `gosh`, `gstt`, `nuth`, `public`, `ripl`, `ucl`, `uclh`, `uhs`, `ukhsa`, `ukhsabris`, `ukhsamanc`, `wtsi`
`climb_id`	`text`	Unique identifier for a project record in Onyx.
`biosample_id`	`text`	The sequencing provider's identifier for a sample.
`biosample_source_id`	`text`	Unique identifier for an individual to permit multiple samples from the same individual to be linked.
`run_id`	`text`	Unique identifier assigned to the run by the sequencing instrument.
`platform`	`choice`	The platform used to sequence the data.	• Choices: `illumina`, `illumina.se`, `ont`
`input_type`	`choice`	The type of input sequenced.	• Choices: `community_standard`, `negative_control`, `positive_control`, `specimen`, `validation_material`
`specimen_type_details`	`choice`	Named control or standard for specimens.	• Choices: `asymptomatic`, `respiratory_infection`
`control_type_details`	`choice`	Named control or standard for positive and negative controls.	• Choices: `NIBSC_11/242`, `NIBSC_20/170`, `bacillus_ms2phage`, `resp_matrix_mc110`, `water_extraction_control`, `zepto_rp2.1`, `zymo-mc_D6300`
`sample_source`	`choice`	The source from which the sample was collected.	• Choices: `blood`, `environment`, `faecal`, `lower_respiratory`, `nose_and_throat`, `other`, `plasma`, `pleural_fluid`, `stool`, `tissue`, `upper_respiratory`, `urine`
`sample_type`	`choice`	The type of sampling method used.	• Choices: `aspirate`, `bal`, `biopsy`, `other`, `sputum`, `swab`
`spike_in`	`choice`	The type of spike-in used in the run.	• Choices: `ERCC-RNA_4456740`, `bacillus_ms2phage`, `ms2-phage`, `none`, `phix`, `tobacco_mosaic_virus`, `zymo_D6320`, `zymo_D6321`
`spike_in_result`	`choice`	Result assigned by scylla for the provided spike-in.	• Choices: `fail`, `partial`, `pass`
`collection_date`	`date`	The date the sample was collected.	• Output format: `YYYY-MM-DD`
`received_date`	`date`	The date the sample was received by the sequencing centre (if collection_date unavailable).	• Output format: `YYYY-MM-DD`
`is_approximate_date`	`bool`	The date is approximate e.g. the sample is from a public repository and it is unclear whether the date corresponds to collection or publishing.
`batch_id`	`text`	Used to identify samples prepared in the same laboratory batch (e.g. extraction, library and/or sequencing).
`study_id`	`text`	Used to identify study or if NHS residual sample.
`study_centre_id`	`text`	Used to identify sequencing centre.
`sequence_purpose`	`choice`	Used to differentiate between clinical or research studies.	• Choices: `clinical`, `research`
`governance_status`	`choice`	Did the patient consent to their sample being used for research purposes or not.	• Choices: `consented_for_research`, `no_consent_for_research`, `open`
`iso_country`	`choice`	Country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB).	• Choices: `AD`, `AE`, `AF`, `AG`, `AI`, `AL`, `AM`, `AO`, `AQ`, `AR`, `AS`, `AT`, `AU`, `AW`, `AX`, `AZ`, `BA`, `BB`, `BD`, `BE`, ...
`iso_region`	`choice`	Region that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB).	• Choices: `GB-ABC`, `GB-ABD`, `GB-ABE`, `GB-AGB`, `GB-AGY`, `GB-AND`, `GB-ANN`, `GB-ANS`, `GB-BAS`, `GB-BBD`, `GB-BCP`, `GB-BDF`, `GB-BDG`, `GB-BEN`, `GB-BEX`, `GB-BFS`, `GB-BGE`, `GB-BGW`, `GB-BIR`, `GB-BKM`, ...
`extraction_enrichment_protocol`	`text`	Details of nucleic acid extraction and optional enrichment steps.
`library_protocol`	`text`	Details of sequencing library construction.
`sequencing_protocol`	`text`	Details of sequencing.
`protocol_arm`	`choice`	Used to indicate arm for protocols which have separate arms for bacterial and viral nucleic acids.	• Choices: `bacterial`, `viral`
`bioinformatics_protocol`	`text`	Detail of initial bioinformatics protocol, for example versions of basecalling software and models used, any read quality filtering/trimming employed.
`dehumanisation_protocol`	`text`	Details of bioinformatics method used for human read removal.
`is_public_dataset`	`bool`	The sample is from a public dataset. Please only set this after it has been made public.
`public_database_name`	`choice`	The public repository where the data is.	• Choices: `ENA`, `SRA`
`public_database_accession`	`text`	The accession for the data in the public database.
`ingest_report`	`text`	HTML report summarising the read profile and taxa identified.
`taxon_reports`	`text`	Folder of all classification output files.
`human_filtered_reads_1`	`text`	Compressed FASTQ of input reads that have been filtered for human reads.
`human_filtered_reads_2`	`text`	Compressed FASTQ of input reads that have been filtered for human reads.
`unclassified_reads_1`	`text`	Compressed FASTQ of input reads which could not be classified.
`unclassified_reads_2`	`text`	Compressed FASTQ of input reads which could not be classified.
`viral_reads_1`	`text`	Compressed FASTQ of input reads which were classified as viral.
`viral_reads_2`	`text`	Compressed FASTQ of input reads which were classified as viral.
`viral_and_unclassified_reads_1`	`text`	Compressed FASTQ of input reads which were classified as viral or were unclassified.
`viral_and_unclassified_reads_2`	`text`	Compressed FASTQ of input reads which were classified as viral or were unclassified.
`total_bases`	`integer`	Total number of bases in the input FASTQ file(s), before any filtering.
`classifier`	`choice`	The classifier used.	• Choices: `Kraken2`
`classifier_version`	`text`	Version of the classifier used.
`classifier_db`	`choice`	Database used for read classification.	• Choices: `PlusPF`
`classifier_db_date`	`date`	Date classifier database was produced.	• Output format: `YYYY-MM-DD`
`ncbi_taxonomy_date`	`date`	Date that the NCBI taxonomy dump was produced.	• Output format: `YYYY-MM-DD`
`scylla_version`	`text`	Version of the scylla pipeline used.
`chimera_bam`	`text`	BAM file of the human filtered read fraction aligned to the zeus database.
`is_chimera_published`	`bool`	Whether chimera has been run on this record or not.
`alignment_db_version`	`text`	Version of the Zeus database used.
`sylph_db_version`	`text`	Sylph database version utilised to produce Sylph classifications.

Relations¶

`taxa_files`¶

Table of all species level taxa extracted by the Scylla ingest pipeline.

Field	Data type	Description	Restrictions
`taxa_files.taxon_id`	`integer`	The NCBI taxonomy id associated with the taxa.
`taxa_files.human_readable`	`text`	A human readable name for the taxa.
`taxa_files.n_reads`	`integer`	The number of reads extracted for the taxa.
`taxa_files.total_bases`	`integer`	Total number of bases extracted for the taxa.
`taxa_files.avg_quality`	`decimal`	The mean quality of reads extracted for the taxa.
`taxa_files.mean_len`	`decimal`	The mean length of reads extracted for the taxa.
`taxa_files.rank`	`choice`	The rank of the taxa.	• Choices: `C`, `D`, `F`, `G`, `K`, `O`, `P`, `R`, `S`, `U`
`taxa_files.fastq_1`	`text`	Compressed FASTQ of extracted reads for the taxa.
`taxa_files.fastq_2`	`text`	Compressed FASTQ of extracted reads for the taxa.

`classifier_calls`¶

Table summarising the NCBI taxonomy ids, counts and ranks of all taxa found by the classifier during the Scylla ingest pipeline.

Field	Data type	Description	Restrictions
`classifier_calls.taxon_id`	`integer`	The NCBI taxonomy id associated with the taxa.
`classifier_calls.human_readable`	`text`	A human readable name for the taxa.
`classifier_calls.percentage`	`decimal`	The percentage of the (dehumanised) sample that the taxa represents.
`classifier_calls.count_descendants`	`integer`	The number of reads mapping to this taxa and all descendant taxa.
`classifier_calls.count_direct`	`integer`	The number of reads mapping directly to the taxa.
`classifier_calls.rank`	`choice`	The rank of the taxa.	• Choices: `C`, `D`, `F`, `G`, `K`, `O`, `P`, `R`, `S`, `U`
`classifier_calls.raw_rank`	`text`	The rank of the taxa including an intermediate grading.
`classifier_calls.is_spike_in`	`bool`	The taxa is a spike-in.

`spike_in_info`¶

Table containing results found for the provided spike-in.

Field	Data type	Description	Restrictions
`spike_in_info.taxon_id`	`integer`	The NCBI taxonomy id associated with the taxa.
`spike_in_info.human_readable`	`text`	A human readable name for the taxa.
`spike_in_info.reference_header`	`text`	Reference header for the individual sequence within the provided spike-in.
`spike_in_info.mapped_count`	`integer`	Number of reads which aligned to a reference sequence for the provided spike-in.

`alignment_results`¶

Table containing alignment results from the viral alignment stage of the CLIMB-TRE/chimera pipeline.

Field	Data type	Description	Restrictions
`alignment_results.taxon_id`	`integer`	The NCBI taxonomy id associated with the taxa.
`alignment_results.human_readable`	`text`	Human readable scientific name for the taxa.
`alignment_results.unique_accession`	`text`	Unique reference identifier in the alignment database (everything prior to the first whitespace in the FASTA header).
`alignment_results.accession_description`	`text`	The comment for the reference sequence within the alignment database.
`alignment_results.sequence_length`	`integer`	Length of the reference sequence in the alignment database.
`alignment_results.evenness_value`	`integer`	A percentage indicating how evenly read depths are distributed throughout the reference, with 0 being completely uneven, and 100 being perfectly even. Taken from https://academic.oup.com/nar/article/38/10/e116/2902812, under the “Calculation of evenness score” section, and calculated here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L102.
`alignment_results.mean_depth`	`integer`	Mean of all depth values across the alignment reference.
`alignment_results.coverage_1x`	`integer`	Percentage of the reference sequence covered with a depth of at least 1x.
`alignment_results.coverage_10x`	`integer`	Percentage of the reference covered with a depth of at least 10x.
`alignment_results.mapped_reads`	`integer`	Total number of reads mapped to the alignment reference.
`alignment_results.uniquely_mapped_reads`	`integer`	Total number of reads which uniquely map to each reference sequence, calculated for each reference sequence here: https://github.com/CLIMB-TRE/chimera/blob/bca6fe6fe293f8e56d9e70627a846dda6f3d886a/bin/generate_alignment_report.py#L59-L82.
`alignment_results.mapped_bases`	`integer`	Approximation for the total number of bases mapped to the alignment reference, calculated from the length of the reference sequence multiplied by the mean depth of alignments to that reference.
`alignment_results.mean_read_identity`	`decimal`	Mean of read identities across all alignments. Can be considered an approximation for identity of the source genome with the reference sequence. Calculated for each read here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L58
`alignment_results.read_duplication_rate`	`decimal`	What proportion of the reads start and end in the same alignment reference position as at least one other read within the alignment. Calculated here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L76-L83
`alignment_results.forward_proportion`	`decimal`	Proportion of reads which aligned to the forward strand. Between 0 and 1, with 0 indicating all reads aligned to the reverse strand, 1 the opposite. True hits should be close to 0.5 for this value for any reasonable mean depth.
`alignment_results.mean_read_length`	`decimal`	Mean read length (inferred from CIGAR string) of all reads aligned to the reference sequence in question.
`alignment_results.mean_alignment_length`	`decimal`	Mean length of all alignments to the reference - different to mean read length aligned to the reference, since it only considers the aligned section of the reads.
`alignment_results.mean_alignment_proportion`	`decimal`	Mean of the proportion of each read which makes up the alignment to the reference (functionally mean_alignment_length / mean_read_length)

`sylph_results`¶

Table containing sylph results from the Sylph classification stage of the CLIMB-TRE/chimera pipeline.

Field	Data type	Description	Restrictions
`sylph_results.taxon_id`	`integer`	The NCBI taxonomy id associated with the taxa.
`sylph_results.human_readable`	`text`	Human readable scientific name for the taxa.
`sylph_results.gtdb_taxon_string`	`text`	Description of the taxonomic placement of the source contig within the Sylph database using GTDBs taxon string format.
`sylph_results.gtdb_assembly_id`	`text`	Assembly ID (often genbank accession) for the contig within the sylph database, taken from GTDB.
`sylph_results.gtdb_contig_header`	`text`	From the origin FASTA record header as it appears in GTDB. Identical to 'Contig_name' field in sylph profile output.
`sylph_results.taxonomic_abundance`	`decimal`	Normalized taxonomic abundance as a percentage. Identical to 'Taxonomic_abundance' in sylph profile output.
`sylph_results.sequence_abundance`	`decimal`	Normalized sequence abundance as a percentage. Identical to 'Sequence_abundance' in sylph profile output.
`sylph_results.adjusted_ani`	`decimal`	If coverage adjustment is possible (cov is < 3x cov): returns coverage-adjusted ANI (Average Nucleotide Identity). If coverage is too low/high: returns naive_ani. Identical to 'Adjusted_ANI' in sylph profile output.
`sylph_results.ani_confidence_interval`	`text`	[5%,95%] confidence intervals. If coverage adjustment is possible: float-float e.g. 98.52-99.55. If coverage is too low/high: NA-NA is given. Identical to 'ANI_5-95_percentile' field in sylph profile output.
`sylph_results.effective_coverage`	`decimal`	Estimated 'λeff' value, true value is not calculated, this is estimated based on kmers. More information is available in the sylph paper: https://www.nature.com/articles/s41587-024-02412-y. If coverage adjustment is possible, lambda estimate is given. Identical to 'Eff_cov' field in sylph profile output.
`sylph_results.effective_coverage_confidence_interval`	`text`	[5%, 95%] confidence intervals for lambda. Same format rules as 'ani_confidence_interval'. Identical to 'Lambda_5-95_percentile' field in sylph profile output.
`sylph_results.median_kmer_cov`	`integer`	Median k-mer multiplicity for k-mers with >= 1 multiplicity. Identical to 'Median_cov' field in sylph profile output.
`sylph_results.mean_kmer_cov`	`decimal`	Mean k-mer multiplicity for k-mers with >= 1 multiplicity. Identical to 'Mean_cov_geq1' field in sylph profile output.
`sylph_results.containment_index`	`text`	int/int showing the containment index (number of k-mers found in sample divided by total k-mers), e.g. 959/1053. Identical to 'Containment_ind' field in sylph profile output.
`sylph_results.naive_ani`	`decimal`	Containment ANI without coverage adjustment. Identical to 'Naive_ANI' field in sylph profile output.
`sylph_results.kmers_reassigned`	`integer`	The number of k-mers reassigned away from the genome. Identical to 'Kmers_reassigned' field in sylph profile output.