published_date |
date |
The date the object was published in Onyx. |
• Output format: iso-8601 |
site |
choice |
The site or sequencing centre providing the data. |
• Choices: bham , synthscape , ukhsa |
climb_id |
text |
Unique identifier for a project record in Onyx. |
|
biosample_id |
text |
The sequencing provider's identifier for a sample. |
|
biosample_source_id |
text |
Unique identifier for an individual to permit multiple samples from the same individual to be linked. |
|
run_id |
text |
Unique identifier assigned to the run by the sequencing instrument. |
|
platform |
choice |
The platform used to sequence the data. |
• Choices: illumina , illumina.se , ont |
input_type |
choice |
The type of input sequenced. |
• Choices: community_standard , negative_control , positive_control , specimen , validation_material |
specimen_type_details |
choice |
Named control or standard for specimens. |
• Choices: asymptomatic , respiratory_infection |
control_type_details |
choice |
Named control or standard for positive and negative controls. |
• Choices: NIBSC_11/242 , NIBSC_20/170 , bacillus_ms2phage , resp_matrix_mc110 , water_extraction_control , zepto_rp2.1 , zymo-mc_D6300 |
sample_source |
choice |
The source from which the sample was collected. |
• Choices: blood , environment , faecal , lower_respiratory , nose_and_throat , other , plasma , pleural_fluid , stool , tissue , upper_respiratory , urine |
sample_type |
choice |
The type of sampling method used. |
• Choices: aspirate , bal , biopsy , other , sputum , swab |
spike_in |
choice |
The type of spike-in used in the run. |
• Choices: ERCC-RNA_4456740 , bacillus_ms2phage , ms2-phage , none , phix , tobacco_mosaic_virus , zymo_D6320 , zymo_D6321 |
spike_in_result |
choice |
Result assigned by scylla for the provided spike-in. |
• Choices: fail , partial , pass |
collection_date |
date |
The date the sample was collected. |
• Output format: YYYY-MM-DD |
received_date |
date |
The date the sample was received by the sequencing centre (if collection_date unavailable). |
• Output format: YYYY-MM-DD |
is_approximate_date |
bool |
The date is approximate e.g. the sample is from a public repository and it is unclear whether the date corresponds to collection or publishing. |
|
batch_id |
text |
Used to identify samples prepared in the same laboratory batch (e.g. extraction, library and/or sequencing). |
|
study_id |
text |
Used to identify study or if NHS residual sample. |
|
study_centre_id |
text |
Used to identify sequencing centre. |
|
sequence_purpose |
choice |
Used to differentiate between clinical or research studies. |
• Choices: clinical , research |
governance_status |
choice |
Did the patient consent to their sample being used for research purposes or not. |
• Choices: consented_for_research , no_consent_for_research , open |
iso_country |
choice |
Country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB). |
• Choices: AD , AE , AF , AG , AI , AL , AM , AO , AQ , AR , AS , AT , AU , AW , AX , AZ , BA , BB , BD , BE , ... |
iso_region |
choice |
Region that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB). |
• Choices: GB-ABC , GB-ABD , GB-ABE , GB-AGB , GB-AGY , GB-AND , GB-ANN , GB-ANS , GB-BAS , GB-BBD , GB-BCP , GB-BDF , GB-BDG , GB-BEN , GB-BEX , GB-BFS , GB-BGE , GB-BGW , GB-BIR , GB-BKM , ... |
extraction_enrichment_protocol |
text |
Details of nucleic acid extraction and optional enrichment steps. |
|
library_protocol |
text |
Details of sequencing library construction. |
|
sequencing_protocol |
text |
Details of sequencing. |
|
protocol_arm |
choice |
Used to indicate arm for protocols which have separate arms for bacterial and viral nucleic acids. |
• Choices: bacterial , viral |
bioinformatics_protocol |
text |
Detail of initial bioinformatics protocol, for example versions of basecalling software and models used, any read quality filtering/trimming employed. |
|
dehumanisation_protocol |
text |
Details of bioinformatics method used for human read removal. |
|
is_public_dataset |
bool |
The sample is from a public dataset. Please only set this after it has been made public. |
|
public_database_name |
choice |
The public repository where the data is. |
• Choices: ENA , SRA |
public_database_accession |
text |
The accession for the data in the public database. |
|
ingest_report |
text |
HTML report summarising the read profile and taxa identified. |
|
taxon_reports |
text |
Folder of all classification output files. |
|
human_filtered_reads_1 |
text |
Compressed FASTQ of input reads that have been filtered for human reads. |
|
human_filtered_reads_2 |
text |
Compressed FASTQ of input reads that have been filtered for human reads. |
|
unclassified_reads_1 |
text |
Compressed FASTQ of input reads which could not be classified. |
|
unclassified_reads_2 |
text |
Compressed FASTQ of input reads which could not be classified. |
|
viral_reads_1 |
text |
Compressed FASTQ of input reads which were classified as viral. |
|
viral_reads_2 |
text |
Compressed FASTQ of input reads which were classified as viral. |
|
viral_and_unclassified_reads_1 |
text |
Compressed FASTQ of input reads which were classified as viral or were unclassified. |
|
viral_and_unclassified_reads_2 |
text |
Compressed FASTQ of input reads which were classified as viral or were unclassified. |
|
total_bases |
integer |
Total number of bases in the input FASTQ file(s), before any filtering. |
|
classifier |
choice |
The classifier used. |
• Choices: Kraken2 |
classifier_version |
text |
Version of the classifier used. |
|
classifier_db |
choice |
Database used for read classification. |
• Choices: PlusPF |
classifier_db_date |
date |
Date classifier database was produced. |
• Output format: YYYY-MM-DD |
ncbi_taxonomy_date |
date |
Date that the NCBI taxonomy dump was produced. |
• Output format: YYYY-MM-DD |
scylla_version |
text |
Version of the scylla pipeline used. |
|
chimera_bam |
text |
BAM file of the human filtered read fraction aligned to the zeus database. |
|
is_chimera_published |
bool |
Whether chimera has been run on this record or not. |
|
alignment_db_version |
text |
Version of the Zeus database used. |
|
sylph_db_version |
text |
Sylph database version utilised to produce Sylph classifications. |
|
source_climb_id |
text |
CLIMB ID of the record used as a base dataset. |
|
spiked_ids |
array |
JSON list of taxon ids included in the spike-in. |
• Array type: integer |
applications |
array |
JSON list of applications. |
• Array type: text |
methods |
structure |
JSON dictionary containing methods. |
|
taxa_files |
relation |
Table of all species level taxa extracted. |
|
taxa_files.taxon_id |
integer |
The NCBI taxonomy id associated with the taxa. |
|
taxa_files.human_readable |
text |
A human readable name for the taxa. |
|
taxa_files.n_reads |
integer |
The number of reads extracted for the taxa. |
|
taxa_files.total_bases |
integer |
Total number of bases extracted for the taxa. |
|
taxa_files.avg_quality |
decimal |
The mean quality of reads extracted for the taxa. |
|
taxa_files.mean_len |
decimal |
The mean length of reads extracted for the taxa. |
|
taxa_files.rank |
choice |
The rank of the taxa. |
• Choices: C , D , F , G , K , O , P , R , S , U |
taxa_files.fastq_1 |
text |
Compressed FASTQ of extracted reads for the taxa. |
|
taxa_files.fastq_2 |
text |
Compressed FASTQ of extracted reads for the taxa. |
|
classifier_calls |
relation |
Table summarising the NCBI taxonomy ids, counts and ranks of all taxa found by the classifier. |
|
classifier_calls.taxon_id |
integer |
The NCBI taxonomy id associated with the taxa. |
|
classifier_calls.human_readable |
text |
A human readable name for the taxa. |
|
classifier_calls.percentage |
decimal |
The percentage of the (dehumanised) sample that the taxa represents. |
|
classifier_calls.count_descendants |
integer |
The number of reads mapping to this taxa and all descendant taxa. |
|
classifier_calls.count_direct |
integer |
The number of reads mapping directly to the taxa. |
|
classifier_calls.rank |
choice |
The rank of the taxa. |
• Choices: C , D , F , G , K , O , P , R , S , U |
classifier_calls.raw_rank |
text |
The rank of the taxa including an intermediate grading. |
|
classifier_calls.is_spike_in |
bool |
The taxa is a spike-in. |
|
spike_in_info |
relation |
Table containing taxonomic results found for the provided spike-in. |
|
spike_in_info.taxon_id |
integer |
The NCBI taxonomy id associated with the taxa. |
|
spike_in_info.human_readable |
text |
A human readable name for the taxa. |
|
spike_in_info.reference_header |
text |
Reference header for the individual sequence within the provided spike-in. |
|
spike_in_info.mapped_count |
integer |
Number of reads which aligned to a reference sequence for the provided spike-in. |
|
alignment_results |
relation |
Table containing alignment results. |
|
alignment_results.taxon_id |
integer |
The NCBI taxonomy id associated with the taxa. |
|
alignment_results.human_readable |
text |
Human readable scientific name for the taxa. |
|
alignment_results.unique_accession |
text |
Unique reference identifier in the alignment database (everything prior to the first whitespace in the FASTA header). |
|
alignment_results.accession_description |
text |
The comment for the reference sequence within the alignment database. |
|
alignment_results.sequence_length |
integer |
Length of the reference sequence in the alignment database. |
|
alignment_results.evenness_value |
integer |
A percentage indicating how evenly read depths are distributed throughout the reference, with 0 being completely uneven, and 100 being perfectly even. Taken from https://academic.oup.com/nar/article/38/10/e116/2902812, under the “Calculation of evenness score” section, and calculated here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L102. |
|
alignment_results.mean_depth |
integer |
Mean of all depth values across the alignment reference. |
|
alignment_results.coverage_1x |
integer |
Percentage of the reference sequence covered with a depth of at least 1x. |
|
alignment_results.coverage_10x |
integer |
Percentage of the reference covered with a depth of at least 10x. |
|
alignment_results.mapped_reads |
integer |
Total number of reads mapped to the alignment reference. |
|
alignment_results.uniquely_mapped_reads |
integer |
Total number of reads which uniquely map to a reference and position within that reference (MAPQ >= 60). |
|
alignment_results.mapped_bases |
integer |
Approximation for the total number of bases mapped to the alignment reference, calculated from the length of the reference sequence multiplied by the mean depth of alignments to that reference. |
|
alignment_results.mean_read_identity |
decimal |
Mean of read identities across all alignments. Can be considered an approximation for identity of the source genome with the reference sequence. Calculated for each read here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L58 |
|
alignment_results.read_duplication_rate |
decimal |
What proportion of the reads start and end in the same alignment reference position as at least one other read within the alignment. Calculated here: https://github.com/CLIMB-TRE/chimera/blob/dca3cacb949dabc902d0a12e5d11d36c6ac555fd/bin/generate_alignment_report.py#L76-L83 |
|
alignment_results.forward_proportion |
decimal |
Proportion of reads which aligned to the forward strand. Between 0 and 1, with 0 indicating all reads aligned to the reverse strand, 1 the opposite. True hits should be close to 0.5 for this value for any reasonable mean depth. |
|
alignment_results.mean_alignment_length |
decimal |
Mean length of all alignments to the reference - different to mean read length aligned to the reference, since it only considers the aligned section of the reads. |
|
sylph_results |
relation |
Table containing sylph results. |
|
sylph_results.taxon_id |
integer |
The NCBI taxonomy id associated with the taxa. |
|
sylph_results.human_readable |
text |
Human readable scientific name for the taxa. |
|
sylph_results.gtdb_taxon_string |
text |
Description of the taxonomic placement of the source contig within the Sylph database using GTDBs taxon string format. |
|
sylph_results.gtdb_assembly_id |
text |
Assembly ID (often genbank accession) for the contig within the sylph database, taken from GTDB. |
|
sylph_results.gtdb_contig_header |
text |
From the origin FASTA record header as it appears in GTDB. Identical to 'Contig_name' field in sylph profile output. |
|
sylph_results.taxonomic_abundance |
decimal |
Normalized taxonomic abundance as a percentage. Identical to 'Taxonomic_abundance' in sylph profile output. |
|
sylph_results.sequence_abundance |
decimal |
Normalized sequence abundance as a percentage. Identical to 'Sequence_abundance' in sylph profile output. |
|
sylph_results.adjusted_ani |
decimal |
If coverage adjustment is possible (cov is < 3x cov): returns coverage-adjusted ANI (Average Nucleotide Identity). If coverage is too low/high: returns naive_ani. Identical to 'Adjusted_ANI' in sylph profile output. |
|
sylph_results.ani_confidence_interval |
text |
[5%,95%] confidence intervals. If coverage adjustment is possible: float-float e.g. 98.52-99.55. If coverage is too low/high: NA-NA is given. Identical to 'ANI_5-95_percentile' field in sylph profile output. |
|
sylph_results.effective_coverage |
decimal |
Estimated 'λeff' value, true value is not calculated, this is estimated based on kmers. More information is available in the sylph paper: https://www.nature.com/articles/s41587-024-02412-y. If coverage adjustment is possible, lambda estimate is given. Identical to 'Eff_cov' field in sylph profile output. |
|
sylph_results.effective_coverage_confidence_interval |
text |
[5%, 95%] confidence intervals for lambda. Same format rules as 'ani_confidence_interval'. Identical to 'Lambda_5-95_percentile' field in sylph profile output. |
|
sylph_results.median_kmer_cov |
integer |
Median k-mer multiplicity for k-mers with >= 1 multiplicity. Identical to 'Median_cov' field in sylph profile output. |
|
sylph_results.mean_kmer_cov |
decimal |
Mean k-mer multiplicity for k-mers with >= 1 multiplicity. Identical to 'Mean_cov_geq1' field in sylph profile output. |
|
sylph_results.containment_index |
text |
int/int showing the containment index (number of k-mers found in sample divided by total k-mers), e.g. 959/1053. Identical to 'Containment_ind' field in sylph profile output. |
|
sylph_results.naive_ani |
decimal |
Containment ANI without coverage adjustment. Identical to 'Naive_ANI' field in sylph profile output. |
|
sylph_results.kmers_reassigned |
integer |
The number of k-mers reassigned away from the genome. Identical to 'Kmers_reassigned' field in sylph profile output. |
|