mSCAPE Uploader Specification¶
Files to be provided¶
For paired-end Illumina sequencing data, suppliers must provide:
- A FASTQ 1 file containing the forward sequencing reads.
- A FASTQ 2 file containing the reverse sequencing reads.
- A CSV file containing the metadata associated with sequencing the sample.
For single-end Illumina sequencing data, suppliers must provide:
- A FASTQ file containing the sequencing reads.
- A CSV file containing the metadata associated with sequencing the sample.
For ONT sequencing data, suppliers must provide:
- A FASTQ file containing the sequencing reads.
- A CSV file containing the metadata associated with sequencing the sample.
Sequencing data must be dehumanised prior to submission. The ingest pipeline will reject sequencing data where the number of assigned human reads exceeds the human read rejection threshold.
File naming convention¶
The base filenames should be of the form
where:
[run_index]
is an identifier that is unique within a sequencing run, e.g. a sequencing barcode identifier, or a 96-well plate co-ordinate.[run_id]
is the name of the sequencing run as given by the supplier's sequencing instrument (not an internal identifier assigned by the supplier).[extension]
is the file extension indicating the file type.
File name extensions¶
For paired-end Illumina sequencing data, the extensions ([extension]
) should be:
1.fastq.gz
for the forward FASTQ file.2.fastq.gz
for the reverse FASTQ file.csv
for the CSV metadata file.
For single-end Illumina sequencing data, the extensions ([extension]
) should be:
fastq.gz
for the forward FASTQ file.csv
for the CSV metadata file.
For ONT sequencing data, the extensions ([extension]
) should be:
fastq.gz
for the forward FASTQ file.csv
for the CSV metadata file.
Valid characters¶
The [run_index]
, [run_id]
and [extension]
must contain only:
- Letters (
A-Z
,a-z
). - Numbers (
0-9
). - Hyphens (
-
). - Underscores (
_
).
Buckets¶
Bucket names follow the general convention:
Metadata specification¶
Required fields¶
Field | Data type | Description | Restrictions |
---|---|---|---|
biosample_id |
text |
The sequencing provider's identifier for a sample. | • Max length: 50 |
run_index |
text |
The sequencing provider's identifier for the position of a sample on a run. | • Max length: 50 |
run_id |
text |
Unique identifier assigned to the run by the sequencing instrument. | • Max length: 100 |
input_type |
choice |
The type of input sequenced. | • Choices: community_standard , negative_control , positive_control , specimen , validation_material |
sample_source |
choice |
The source from which the sample was collected. | • Choices: blood , environment , faecal , lower_respiratory , nose_and_throat , other , plasma , pleural_fluid , stool , tissue , upper_respiratory , urine |
sample_type |
choice |
The type of sampling method used. | • Choices: aspirate , bal , biopsy , other , sputum , swab |
spike_in |
choice |
The type of spike-in used in the run. | • Choices: ERCC-RNA_4456740 , ms2-phage , none , phix , tobacco_mosaic_virus , zymo_D6320 , zymo_D6321 |
At least one of the following fields are required:
Field | Data type | Description | Restrictions |
---|---|---|---|
collection_date |
date |
The date the sample was collected. | • Input formats: YYYY-MM , YYYY-MM-DD • Output format: YYYY-MM-DD |
received_date |
date |
The date the sample was received by the sequencing centre (if collection_date unavailable). | • Input formats: YYYY-MM , YYYY-MM-DD • Output format: YYYY-MM-DD |
Optional fields¶
Field | Data type | Description | Restrictions |
---|---|---|---|
biosample_source_id |
text |
Unique identifier for an individual to permit multiple samples from the same individual to be linked. | • Max length: 50 |
specimen_type_details |
choice |
Named control or standard for specimens. | • Required when input_type is: specimen • Choices: asymptomatic , respiratory_infection |
control_type_details |
choice |
Named control or standard for positive and negative controls. | • Required when input_type is: positive_control • Required when input_type is: negative_control • Choices: NIBSC_11/242 , NIBSC_20/170 , water_extraction_control , zepto_rp2.1 , zymo-mc_D6300 |
is_approximate_date |
bool |
The date is approximate e.g. the sample is from a public repository and it is unclear whether the date corresponds to collection or publishing. | • Default: False |
batch_id |
text |
Used to identify samples prepared in the same laboratory batch (e.g. extraction, library and/or sequencing). | • Max length: 100 |
study_id |
text |
Used to identify study or if NHS residual sample. | • Max length: 100 |
study_centre_id |
text |
Used to identify sequencing centre. | • Max length: 100 |
sequence_purpose |
choice |
Used to differentiate between clinical or research studies. | • Choices: clinical , research |
governance_status |
choice |
Did the patient consent to their sample being used for research purposes or not. | • Default: no_consent_for_research • Choices: consented_for_research , no_consent_for_research , open |
iso_country |
choice |
Country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB). | • Choices: AD , AE , AF , AG , AI , AL , AM , AO , AQ , AR , AS , AT , AU , AW , AX , AZ , BA , BB , BD , BE , ... |
iso_region |
choice |
Region that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB). | • Requires: iso_country • Choices: GB-ABC , GB-ABD , GB-ABE , GB-AGB , GB-AGY , GB-AND , GB-ANN , GB-ANS , GB-BAS , GB-BBD , GB-BCP , GB-BDF , GB-BDG , GB-BEN , GB-BEX , GB-BFS , GB-BGE , GB-BGW , GB-BIR , GB-BKM , ... |
extraction_enrichment_protocol |
text |
Details of nucleic acid extraction and optional enrichment steps. | |
library_protocol |
text |
Details of sequencing library construction. | |
sequencing_protocol |
text |
Details of sequencing. | |
bioinformatics_protocol |
text |
Detail of initial bioinformatics protocol, for example versions of basecalling software and models used, any read quality filtering/trimming employed. | |
dehumanisation_protocol |
text |
Details of bioinformatics method used for human read removal. | |
is_public_dataset |
bool |
The sample is from a public dataset. Please only set this after it has been made public. | • Default: False |
public_database_name |
choice |
The public repository where the data is. | • Choices: ENA , SRA |
public_database_accession |
text |
The accession for the data in the public database. |