synthSCAPE Analysis Specification¶

Analysis fields¶

Field	Data type	Description	Restrictions
`published_date`	`date`	The date the object was published in Onyx.	• Output format: `iso-8601`
`site`	`choice`	The site or sequencing centre providing the data.	• Choices: `bham`, `synthscape`, `ukhsa`
`climb_id`	`text`	Unique identifier for a project record in Onyx.
`biosample_id`	`text`	The sequencing provider's identifier for a sample.
`biosample_source_id`	`text`	Unique identifier for an individual to permit multiple samples from the same individual to be linked.
`run_id`	`text`	Unique identifier assigned to the run by the sequencing instrument.
`platform`	`choice`	The platform used to sequence the data.	• Choices: `illumina`, `illumina.se`, `ont`
`input_type`	`choice`	The type of input sequenced.	• Choices: `community_standard`, `negative_control`, `positive_control`, `specimen`, `validation_material`
`specimen_type_details`	`choice`	Named control or standard for specimens.	• Choices: `asymptomatic`, `respiratory_infection`
`control_type_details`	`choice`	Named control or standard for positive and negative controls.	• Choices: `NIBSC_11/242`, `NIBSC_20/170`, `resp_matrix_mc110`, `water_extraction_control`, `zepto_rp2.1`, `zymo-mc_D6300`
`sample_source`	`choice`	The source from which the sample was collected.	• Choices: `blood`, `environment`, `faecal`, `lower_respiratory`, `nose_and_throat`, `other`, `plasma`, `pleural_fluid`, `stool`, `tissue`, `upper_respiratory`, `urine`
`sample_type`	`choice`	The type of sampling method used.	• Choices: `aspirate`, `bal`, `biopsy`, `other`, `sputum`, `swab`
`spike_in`	`choice`	The type of spike-in used in the run.	• Choices: `ERCC-RNA_4456740`, `bacillus_ms2phage`, `ms2-phage`, `none`, `phix`, `tobacco_mosaic_virus`, `zymo_D6320`, `zymo_D6321`
`spike_in_result`	`choice`	Result assigned by scylla for the provided spike-in.	• Choices: `fail`, `partial`, `pass`
`collection_date`	`date`	The date the sample was collected.	• Output format: `YYYY-MM-DD`
`received_date`	`date`	The date the sample was received by the sequencing centre (if collection_date unavailable).	• Output format: `YYYY-MM-DD`
`is_approximate_date`	`bool`	The date is approximate e.g. the sample is from a public repository and it is unclear whether the date corresponds to collection or publishing.
`batch_id`	`text`	Used to identify samples prepared in the same laboratory batch (e.g. extraction, library and/or sequencing).
`study_id`	`text`	Used to identify study or if NHS residual sample.
`study_centre_id`	`text`	Used to identify sequencing centre.
`sequence_purpose`	`choice`	Used to differentiate between clinical or research studies.	• Choices: `clinical`, `research`
`governance_status`	`choice`	Did the patient consent to their sample being used for research purposes or not.	• Choices: `consented_for_research`, `no_consent_for_research`, `open`
`iso_country`	`choice`	Country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB).	• Choices: `AD`, `AE`, `AF`, `AG`, `AI`, `AL`, `AM`, `AO`, `AQ`, `AR`, `AS`, `AT`, `AU`, `AW`, `AX`, `AZ`, `BA`, `BB`, `BD`, `BE`, ...
`iso_region`	`choice`	Region that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB).	• Choices: `GB-ABC`, `GB-ABD`, `GB-ABE`, `GB-AGB`, `GB-AGY`, `GB-AND`, `GB-ANN`, `GB-ANS`, `GB-BAS`, `GB-BBD`, `GB-BCP`, `GB-BDF`, `GB-BDG`, `GB-BEN`, `GB-BEX`, `GB-BFS`, `GB-BGE`, `GB-BGW`, `GB-BIR`, `GB-BKM`, ...
`extraction_enrichment_protocol`	`text`	Details of nucleic acid extraction and optional enrichment steps.
`library_protocol`	`text`	Details of sequencing library construction.
`sequencing_protocol`	`text`	Details of sequencing.
`bioinformatics_protocol`	`text`	Detail of initial bioinformatics protocol, for example versions of basecalling software and models used, any read quality filtering/trimming employed.
`dehumanisation_protocol`	`text`	Details of bioinformatics method used for human read removal.
`is_public_dataset`	`bool`	The sample is from a public dataset. Please only set this after it has been made public.
`public_database_name`	`choice`	The public repository where the data is.	• Choices: `ENA`, `SRA`
`public_database_accession`	`text`	The accession for the data in the public database.
`ingest_report`	`text`	HTML report summarising the read profile and taxa identified.
`taxon_reports`	`text`	Folder of all classification output files.
`human_filtered_reads_1`	`text`	Compressed FASTQ of input reads that have been filtered for human reads.
`human_filtered_reads_2`	`text`	Compressed FASTQ of input reads that have been filtered for human reads.
`unclassified_reads_1`	`text`	Compressed FASTQ of input reads which could not be classified.
`unclassified_reads_2`	`text`	Compressed FASTQ of input reads which could not be classified.
`viral_reads_1`	`text`	Compressed FASTQ of input reads which were classified as viral.
`viral_reads_2`	`text`	Compressed FASTQ of input reads which were classified as viral.
`viral_and_unclassified_reads_1`	`text`	Compressed FASTQ of input reads which were classified as viral or were unclassified.
`viral_and_unclassified_reads_2`	`text`	Compressed FASTQ of input reads which were classified as viral or were unclassified.
`total_bases`	`integer`	Total number of bases in the input FASTQ file(s), before any filtering.
`classifier`	`choice`	The classifier used.	• Choices: `Kraken2`
`classifier_version`	`text`	Version of the classifier used.
`classifier_db`	`choice`	Database used for read classification.	• Choices: `PlusPF`
`classifier_db_date`	`date`	Date classifier database was produced.	• Output format: `YYYY-MM-DD`
`ncbi_taxonomy_date`	`date`	Date that the NCBI taxonomy dump was produced.	• Output format: `YYYY-MM-DD`
`scylla_version`	`text`	Version of the scylla pipeline used.
`source_climb_id`	`text`	CLIMB ID of the record used as a base dataset.
`spiked_ids`	`array`	JSON list of taxon ids included in the spike-in.	• Array type: `integer`
`applications`	`array`	JSON list of applications.	• Array type: `text`
`methods`	`structure`	JSON dictionary containing methods.
`taxa_files`	`relation`	Table of all species level taxa extracted.
`taxa_files.taxon_id`	`integer`	The NCBI taxonomy id associated with the taxa.
`taxa_files.human_readable`	`text`	A human readable name for the taxa.
`taxa_files.n_reads`	`integer`	The number of reads extracted for the taxa.
`taxa_files.total_bases`	`integer`	Total number of bases extracted for the taxa.
`taxa_files.avg_quality`	`decimal`	The mean quality of reads extracted for the taxa.
`taxa_files.mean_len`	`decimal`	The mean length of reads extracted for the taxa.
`taxa_files.rank`	`choice`	The rank of the taxa.	• Choices: `C`, `D`, `F`, `G`, `K`, `O`, `P`, `R`, `S`, `U`
`taxa_files.fastq_1`	`text`	Compressed FASTQ of extracted reads for the taxa.
`taxa_files.fastq_2`	`text`	Compressed FASTQ of extracted reads for the taxa.
`classifier_calls`	`relation`	Table summarising the NCBI taxonomy ids, counts and ranks of all taxa found by the classifier.
`classifier_calls.taxon_id`	`integer`	The NCBI taxonomy id associated with the taxa.
`classifier_calls.human_readable`	`text`	A human readable name for the taxa.
`classifier_calls.percentage`	`decimal`	The percentage of the (dehumanised) sample that the taxa represents.
`classifier_calls.count_descendants`	`integer`	The number of reads mapping to this taxa and all descendant taxa.
`classifier_calls.count_direct`	`integer`	The number of reads mapping directly to the taxa.
`classifier_calls.rank`	`choice`	The rank of the taxa.	• Choices: `C`, `D`, `F`, `G`, `K`, `O`, `P`, `R`, `S`, `U`
`classifier_calls.raw_rank`	`text`	The rank of the taxa including an intermediate grading.
`classifier_calls.is_spike_in`	`bool`	The taxa is a spike-in.
`spike_in_info`	`relation`	Table containing taxonomic results found for the provided spike-in.
`spike_in_info.taxon_id`	`integer`	The NCBI taxonomy id associated with the taxa.
`spike_in_info.human_readable`	`text`	A human readable name for the taxa.
`spike_in_info.reference_header`	`text`	Reference header for the individual sequence within the provided spike-in.
`spike_in_info.mapped_count`	`integer`	Number of reads which aligned to a reference sequence for the provided spike-in.