Skip to content

synthSCAPE Analysis Specification

Analysis fields

Field                                         Data type Description Restrictions
published_date date The date the object was published in Onyx. • Output format: iso-8601
site choice The site or sequencing centre providing the data. • Choices: bham, synthscape, ukhsa
climb_id text Unique identifier for a project record in Onyx.
biosample_id text The sequencing provider's identifier for a sample.
biosample_source_id text Unique identifier for an individual to permit multiple samples from the same individual to be linked.
run_id text Unique identifier assigned to the run by the sequencing instrument.
platform choice The platform used to sequence the data. • Choices: illumina, illumina.se, ont
input_type choice The type of input sequenced. • Choices: community_standard, negative_control, positive_control, specimen, validation_material
specimen_type_details choice Named control or standard for specimens. • Choices: asymptomatic, respiratory_infection
control_type_details choice Named control or standard for positive and negative controls. • Choices: NIBSC_11/242, NIBSC_20/170, bacillus_ms2phage, resp_matrix_mc110, water_extraction_control, zepto_rp2.1, zymo-mc_D6300
sample_source choice The source from which the sample was collected. • Choices: blood, environment, faecal, lower_respiratory, nose_and_throat, other, plasma, pleural_fluid, stool, tissue, upper_respiratory, urine
sample_type choice The type of sampling method used. • Choices: aspirate, bal, biopsy, other, sputum, swab
spike_in choice The type of spike-in used in the run. • Choices: ERCC-RNA_4456740, bacillus_ms2phage, ms2-phage, none, phix, tobacco_mosaic_virus, zymo_D6320, zymo_D6321
spike_in_result choice Result assigned by scylla for the provided spike-in. • Choices: fail, partial, pass
collection_date date The date the sample was collected. • Output format: YYYY-MM-DD
received_date date The date the sample was received by the sequencing centre (if collection_date unavailable). • Output format: YYYY-MM-DD
is_approximate_date bool The date is approximate e.g. the sample is from a public repository and it is unclear whether the date corresponds to collection or publishing.
batch_id text Used to identify samples prepared in the same laboratory batch (e.g. extraction, library and/or sequencing).
study_id text Used to identify study or if NHS residual sample.
study_centre_id text Used to identify sequencing centre.
sequence_purpose choice Used to differentiate between clinical or research studies. • Choices: clinical, research
governance_status choice Did the patient consent to their sample being used for research purposes or not. • Choices: consented_for_research, no_consent_for_research, open
iso_country choice Country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB). • Choices: AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, AT, AU, AW, AX, AZ, BA, BB, BD, BE, ...
iso_region choice Region that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB). • Choices: GB-ABC, GB-ABD, GB-ABE, GB-AGB, GB-AGY, GB-AND, GB-ANN, GB-ANS, GB-BAS, GB-BBD, GB-BCP, GB-BDF, GB-BDG, GB-BEN, GB-BEX, GB-BFS, GB-BGE, GB-BGW, GB-BIR, GB-BKM, ...
extraction_enrichment_protocol text Details of nucleic acid extraction and optional enrichment steps.
library_protocol text Details of sequencing library construction.
sequencing_protocol text Details of sequencing.
protocol_arm choice Used to indicate arm for protocols which have separate arms for bacterial and viral nucleic acids. • Choices: bacterial, viral
bioinformatics_protocol text Detail of initial bioinformatics protocol, for example versions of basecalling software and models used, any read quality filtering/trimming employed.
dehumanisation_protocol text Details of bioinformatics method used for human read removal.
is_public_dataset bool The sample is from a public dataset. Please only set this after it has been made public.
public_database_name choice The public repository where the data is. • Choices: ENA, SRA
public_database_accession text The accession for the data in the public database.
ingest_report text HTML report summarising the read profile and taxa identified.
taxon_reports text Folder of all classification output files.
human_filtered_reads_1 text Compressed FASTQ of input reads that have been filtered for human reads.
human_filtered_reads_2 text Compressed FASTQ of input reads that have been filtered for human reads.
unclassified_reads_1 text Compressed FASTQ of input reads which could not be classified.
unclassified_reads_2 text Compressed FASTQ of input reads which could not be classified.
viral_reads_1 text Compressed FASTQ of input reads which were classified as viral.
viral_reads_2 text Compressed FASTQ of input reads which were classified as viral.
viral_and_unclassified_reads_1 text Compressed FASTQ of input reads which were classified as viral or were unclassified.
viral_and_unclassified_reads_2 text Compressed FASTQ of input reads which were classified as viral or were unclassified.
total_bases integer Total number of bases in the input FASTQ file(s), before any filtering.
classifier choice The classifier used. • Choices: Kraken2
classifier_version text Version of the classifier used.
classifier_db choice Database used for read classification. • Choices: PlusPF
classifier_db_date date Date classifier database was produced. • Output format: YYYY-MM-DD
ncbi_taxonomy_date date Date that the NCBI taxonomy dump was produced. • Output format: YYYY-MM-DD
scylla_version text Version of the scylla pipeline used.
chimera_bam text BAM file of the human filtered read fraction aligned to the zeus database.
is_chimera_published bool Whether chimera has been run on this record or not.
alignment_db_version text Version of the Zeus database used.
sylph_db_version text Sylph database version utilised to produce Sylph classifications.
source_climb_id text CLIMB ID of the record used as a base dataset.
spiked_ids array JSON list of taxon ids included in the spike-in. • Array type: integer
applications array JSON list of applications. • Array type: text
methods structure JSON dictionary containing methods.
taxa_files relation Table of all species level taxa extracted.
taxa_files.taxon_id integer The NCBI taxonomy id associated with the taxa.
taxa_files.human_readable text A human readable name for the taxa.
taxa_files.n_reads integer The number of reads extracted for the taxa.
taxa_files.total_bases integer Total number of bases extracted for the taxa.
taxa_files.avg_quality decimal The mean quality of reads extracted for the taxa.
taxa_files.mean_len decimal The mean length of reads extracted for the taxa.
taxa_files.rank choice The rank of the taxa. • Choices: C, D, F, G, K, O, P, R, S, U
taxa_files.fastq_1 text Compressed FASTQ of extracted reads for the taxa.
taxa_files.fastq_2 text Compressed FASTQ of extracted reads for the taxa.
classifier_calls relation Table summarising the NCBI taxonomy ids, counts and ranks of all taxa found by the classifier.
classifier_calls.taxon_id integer The NCBI taxonomy id associated with the taxa.
classifier_calls.human_readable text A human readable name for the taxa.
classifier_calls.percentage decimal The percentage of the (dehumanised) sample that the taxa represents.
classifier_calls.count_descendants integer The number of reads mapping to this taxa and all descendant taxa.
classifier_calls.count_direct integer The number of reads mapping directly to the taxa.
classifier_calls.rank choice The rank of the taxa. • Choices: C, D, F, G, K, O, P, R, S, U
classifier_calls.raw_rank text The rank of the taxa including an intermediate grading.
classifier_calls.is_spike_in bool The taxa is a spike-in.
spike_in_info relation Table containing taxonomic results found for the provided spike-in.
spike_in_info.taxon_id integer The NCBI taxonomy id associated with the taxa.
spike_in_info.human_readable text A human readable name for the taxa.
spike_in_info.reference_header text Reference header for the individual sequence within the provided spike-in.
spike_in_info.mapped_count integer Number of reads which aligned to a reference sequence for the provided spike-in.
alignment_results relation Table containing alignment results.
alignment_results.taxon_id integer The NCBI taxonomy id associated with the taxa.
alignment_results.human_readable text Human readable scientific name for the taxa.
alignment_results.unique_accession text Unique reference identifier in the alignment database (everything prior to the first whitespace in the FASTA header).
alignment_results.accession_description text The comment for the reference sequence within the alignment database.
alignment_results.sequence_length integer Length of the reference sequence in the alignment database.
alignment_results.evenness_value integer A percentage indicating how evenly read depths are distributed throughout the reference, with 0 being completely uneven, and 100 being perfectly even. Taken from https://academic.oup.com/nar/article/38/10/e116/2902812, under the “Calculation of evenness score” section, and calculated here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L102.
alignment_results.mean_depth integer Mean of all depth values across the alignment reference.
alignment_results.coverage_1x integer Percentage of the reference sequence covered with a depth of at least 1x.
alignment_results.coverage_10x integer Percentage of the reference covered with a depth of at least 10x.
alignment_results.mapped_reads integer Total number of reads mapped to the alignment reference.
alignment_results.uniquely_mapped_reads integer Total number of reads which uniquely map to a reference and position within that reference (MAPQ >= 60).
alignment_results.mapped_bases integer Approximation for the total number of bases mapped to the alignment reference, calculated from the length of the reference sequence multiplied by the mean depth of alignments to that reference.
alignment_results.mean_read_identity decimal Mean of read identities across all alignments. Can be considered an approximation for identity of the source genome with the reference sequence. Calculated for each read here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L58
alignment_results.read_duplication_rate decimal What proportion of the reads start and end in the same alignment reference position as at least one other read within the alignment. Calculated here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L76-L83
alignment_results.forward_proportion decimal Proportion of reads which aligned to the forward strand. Between 0 and 1, with 0 indicating all reads aligned to the reverse strand, 1 the opposite. True hits should be close to 0.5 for this value for any reasonable mean depth.
alignment_results.mean_alignment_length decimal Mean length of all alignments to the reference - different to mean read length aligned to the reference, since it only considers the aligned section of the reads.
sylph_results relation Table containing sylph results.
sylph_results.taxon_id integer The NCBI taxonomy id associated with the taxa.
sylph_results.human_readable text Human readable scientific name for the taxa.
sylph_results.gtdb_taxon_string text Description of the taxonomic placement of the source contig within the Sylph database using GTDBs taxon string format.
sylph_results.gtdb_assembly_id text Assembly ID (often genbank accession) for the contig within the sylph database, taken from GTDB.
sylph_results.gtdb_contig_header text From the origin FASTA record header as it appears in GTDB. Identical to 'Contig_name' field in sylph profile output.
sylph_results.taxonomic_abundance decimal Normalized taxonomic abundance as a percentage. Identical to 'Taxonomic_abundance' in sylph profile output.
sylph_results.sequence_abundance decimal Normalized sequence abundance as a percentage. Identical to 'Sequence_abundance' in sylph profile output.
sylph_results.adjusted_ani decimal If coverage adjustment is possible (cov is < 3x cov): returns coverage-adjusted ANI (Average Nucleotide Identity). If coverage is too low/high: returns naive_ani. Identical to 'Adjusted_ANI' in sylph profile output.
sylph_results.ani_confidence_interval text [5%,95%] confidence intervals. If coverage adjustment is possible: float-float e.g. 98.52-99.55. If coverage is too low/high: NA-NA is given. Identical to 'ANI_5-95_percentile' field in sylph profile output.
sylph_results.effective_coverage decimal Estimated 'λeff' value, true value is not calculated, this is estimated based on kmers. More information is available in the sylph paper: https://www.nature.com/articles/s41587-024-02412-y. If coverage adjustment is possible, lambda estimate is given. Identical to 'Eff_cov' field in sylph profile output.
sylph_results.effective_coverage_confidence_interval text [5%, 95%] confidence intervals for lambda. Same format rules as 'ani_confidence_interval'. Identical to 'Lambda_5-95_percentile' field in sylph profile output.
sylph_results.median_kmer_cov integer Median k-mer multiplicity for k-mers with >= 1 multiplicity. Identical to 'Median_cov' field in sylph profile output.
sylph_results.mean_kmer_cov decimal Mean k-mer multiplicity for k-mers with >= 1 multiplicity. Identical to 'Mean_cov_geq1' field in sylph profile output.
sylph_results.containment_index text int/int showing the containment index (number of k-mers found in sample divided by total k-mers), e.g. 959/1053. Identical to 'Containment_ind' field in sylph profile output.
sylph_results.naive_ani decimal Containment ANI without coverage adjustment. Identical to 'Naive_ANI' field in sylph profile output.
sylph_results.kmers_reassigned integer The number of k-mers reassigned away from the genome. Identical to 'Kmers_reassigned' field in sylph profile output.