Skip to content

synthSCAPE Analysis Specification

Analysis fields

Field                                         Data type Description Restrictions
climb_id text Unique identifier for a project record in Onyx.
published_date date The date the project record was published in Onyx. • Output format: iso-8601
site choice The site or sequencing centre providing the data. • Choices: bham, gstt, public, synthscape, ukhsa
biosample_id text The sequencing provider's identifier for a sample.
biosample_source_id text Unique identifier for an individual to permit multiple samples from the same individual to be linked.
run_id text Unique identifier assigned to the run by the sequencing instrument.
platform choice The platform used to sequence the data. • Choices: illumina, illumina.se, ont
input_type choice The type of input sequenced. • Choices: community_standard, negative_control, positive_control, specimen, validation_material
specimen_type_details choice Named control or standard for specimens. • Choices: asymptomatic, respiratory_infection
control_type_details choice Named control or standard for positive and negative controls. • Choices: NIBSC_11/242, NIBSC_20/170, water_extraction_control, zepto_rp2.1, zymo-mc_D6300
sample_source choice The source from which the sample was collected. • Choices: blood, environment, faecal, lower_respiratory, nose_and_throat, other, plasma, pleural_fluid, stool, tissue, upper_respiratory, urine
sample_type choice The type of sampling method used. • Choices: aspirate, bal, biopsy, other, sputum, swab
spike_in choice The type of spike-in used in the run. • Choices: ERCC-RNA_4456740, ms2-phage, none, phix, tobacco_mosaic_virus, zymo_D6320, zymo_D6321
spike_in_result choice Result assigned by scylla for the provided spike-in. • Choices: fail, partial, pass
collection_date date The date the sample was collected. • Output format: YYYY-MM-DD
received_date date The date the sample was received by the sequencing centre (if collection_date unavailable). • Output format: YYYY-MM-DD
is_approximate_date bool The date is approximate e.g. the sample is from a public repository and it is unclear whether the date corresponds to collection or publishing.
batch_id text Used to identify samples prepared in the same laboratory batch (e.g. extraction, library and/or sequencing).
study_id text Used to identify study or if NHS residual sample.
study_centre_id text Used to identify sequencing centre.
sequence_purpose choice Used to differentiate between clinical or research studies. • Choices: clinical, research
governance_status choice Did the patient consent to their sample being used for research purposes or not. • Choices: consented_for_research, no_consent_for_research, open
iso_country choice Country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB). • Choices: AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, AT, AU, AW, AX, AZ, BA, BB, BD, BE, ...
iso_region choice Region that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB). • Choices: GB-ABC, GB-ABD, GB-ABE, GB-AGB, GB-AGY, GB-AND, GB-ANN, GB-ANS, GB-BAS, GB-BBD, GB-BCP, GB-BDF, GB-BDG, GB-BEN, GB-BEX, GB-BFS, GB-BGE, GB-BGW, GB-BIR, GB-BKM, ...
extraction_enrichment_protocol text Details of nucleic acid extraction and optional enrichment steps.
library_protocol text Details of sequencing library construction.
sequencing_protocol text Details of sequencing.
bioinformatics_protocol text Detail of initial bioinformatics protocol, for example versions of basecalling software and models used, any read quality filtering/trimming employed.
dehumanisation_protocol text Details of bioinformatics method used for human read removal.
is_public_dataset bool The sample is from a public dataset. Please only set this after it has been made public.
public_database_name choice The public repository where the data is. • Choices: ENA, SRA
public_database_accession text The accession for the data in the public database.
ingest_report text HTML report summarising the read profile and taxa identified.
taxon_reports text Folder of all classification output files.
human_filtered_reads_1 text Compressed FASTQ of input reads that have been filtered for human reads.
human_filtered_reads_2 text Compressed FASTQ of input reads that have been filtered for human reads.
unclassified_reads_1 text Compressed FASTQ of input reads which could not be classified.
unclassified_reads_2 text Compressed FASTQ of input reads which could not be classified.
viral_reads_1 text Compressed FASTQ of input reads which were classified as viral.
viral_reads_2 text Compressed FASTQ of input reads which were classified as viral.
viral_and_unclassified_reads_1 text Compressed FASTQ of input reads which were classified as viral or were unclassified.
viral_and_unclassified_reads_2 text Compressed FASTQ of input reads which were classified as viral or were unclassified.
classifier choice The classifier used. • Choices: Kraken2
classifier_version text Version of the classifier used.
classifier_db choice Database used for read classification. • Choices: PlusPF
classifier_db_date date Date classifier database was produced. • Output format: YYYY-MM-DD
ncbi_taxonomy_date date Date that the NCBI taxonomy dump was produced. • Output format: YYYY-MM-DD
scylla_version text Version of the scylla pipeline used.
source_climb_id text CLIMB ID of the record used as a base dataset.
spiked_ids array JSON list of taxon ids included in the spike-in. • Array type: integer
applications array JSON list of applications. • Array type: text
methods structure JSON dictionary containing methods.
taxa_files relation Table of all species level taxa extracted.
taxa_files.taxon_id integer The NCBI taxonomy id associated with the taxa.
taxa_files.human_readable text A human readable name for the taxa.
taxa_files.n_reads integer The number of reads extracted for the taxa.
taxa_files.avg_quality decimal The mean quality of reads extracted for the taxa.
taxa_files.mean_len decimal The mean length of reads extracted for the taxa.
taxa_files.rank choice The rank of the taxa. • Choices: C, D, F, G, K, O, P, R, S, U
taxa_files.fastq_1 text Compressed FASTQ of extracted reads for the taxa.
taxa_files.fastq_2 text Compressed FASTQ of extracted reads for the taxa.
classifier_calls relation Table summarising the NCBI taxonomy ids, counts and ranks of all taxa found by the classifier.
classifier_calls.taxon_id integer The NCBI taxonomy id associated with the taxa.
classifier_calls.human_readable text A human readable name for the taxa.
classifier_calls.percentage decimal The percentage of the (dehumanised) sample that the taxa represents.
classifier_calls.count_descendants integer The number of reads mapping to this taxa and all descendant taxa.
classifier_calls.count_direct integer The number of reads mapping directly to the taxa.
classifier_calls.rank choice The rank of the taxa. • Choices: C, D, F, G, K, O, P, R, S, U
classifier_calls.raw_rank text The rank of the taxa including an intermediate grading.
classifier_calls.is_spike_in bool The taxa is a spike-in.
spike_in_info relation Table containing taxonomic results found for the provided spike-in.
spike_in_info.taxon_id integer The NCBI taxonomy id associated with the taxa.
spike_in_info.human_readable text A human readable name for the taxa.
spike_in_info.reference_header text Reference header for the individual sequence within the provided spike-in.
spike_in_info.mapped_count integer Number of reads which aligned to a reference sequence for the provided spike-in.