Skip to content

mSCAPE Uploader Specification

Files to be provided

For paired-end Illumina sequencing data, suppliers must provide:

  • A FASTQ 1 file containing the forward sequencing reads.
  • A FASTQ 2 file containing the reverse sequencing reads.
  • A CSV file containing the metadata associated with sequencing the sample.

For single-end Illumina sequencing data, suppliers must provide:

  • A FASTQ file containing the sequencing reads.
  • A CSV file containing the metadata associated with sequencing the sample.

For ONT sequencing data, suppliers must provide:

  • A FASTQ file containing the sequencing reads.
  • A CSV file containing the metadata associated with sequencing the sample.

Sequencing data must be dehumanised prior to submission. The ingest pipeline will reject sequencing data where the number of assigned human reads exceeds the human read rejection threshold.

File naming convention

The base filenames should be of the form

mscape.[run_index].[run_id].[extension]

where:

  • [run_index] is an identifier that is unique within a sequencing run, e.g. a sequencing barcode identifier, or a 96-well plate co-ordinate.
  • [run_id] is the name of the sequencing run as given by the supplier's sequencing instrument (not an internal identifier assigned by the supplier).
  • [extension] is the file extension indicating the file type.

File name extensions

For paired-end Illumina sequencing data, the extensions ([extension]) should be:

  • 1.fastq.gz for the forward FASTQ file.
  • 2.fastq.gz for the reverse FASTQ file.
  • csv for the CSV metadata file.

For single-end Illumina sequencing data, the extensions ([extension]) should be:

  • fastq.gz for the forward FASTQ file.
  • csv for the CSV metadata file.

For ONT sequencing data, the extensions ([extension]) should be:

  • fastq.gz for the forward FASTQ file.
  • csv for the CSV metadata file.

Valid characters

The [run_index], [run_id] and [extension] must contain only:

  • Letters (A-Z, a-z).
  • Numbers (0-9).
  • Hyphens (-).
  • Underscores (_).

Buckets

Bucket names follow the general convention:

mscape-[sequencing_org]-[platform]-[test_flag]

Metadata specification

Required fields

Field                                         Data type Description Restrictions
biosample_id text The sequencing provider's identifier for a sample. • Max length: 50
run_index text The sequencing provider's identifier for the position of a sample on a run. • Max length: 50
run_id text Unique identifier assigned to the run by the sequencing instrument. • Max length: 100
input_type choice The type of input sequenced. • Choices: community_standard, negative_control, positive_control, specimen, validation_material
sample_source choice The source from which the sample was collected. • Choices: blood, environment, faecal, lower_respiratory, nose_and_throat, other, plasma, pleural_fluid, stool, tissue, upper_respiratory, urine
sample_type choice The type of sampling method used. • Choices: aspirate, bal, biopsy, other, sputum, swab
spike_in choice The type of spike-in used in the run. • Choices: ERCC-RNA_4456740, ms2-phage, none, phix, tobacco_mosaic_virus, zymo_D6320, zymo_D6321

At least one of the following fields are required:

Field                                         Data type Description Restrictions
collection_date date The date the sample was collected. • Input formats: YYYY-MM, YYYY-MM-DD
• Output format: YYYY-MM-DD
received_date date The date the sample was received by the sequencing centre (if collection_date unavailable). • Input formats: YYYY-MM, YYYY-MM-DD
• Output format: YYYY-MM-DD

Optional fields

Field                                         Data type Description Restrictions
biosample_source_id text Unique identifier for an individual to permit multiple samples from the same individual to be linked. • Max length: 50
specimen_type_details choice Named control or standard for specimens. • Required when input_type is: specimen
• Choices: asymptomatic, respiratory_infection
control_type_details choice Named control or standard for positive and negative controls. • Required when input_type is: positive_control
• Required when input_type is: negative_control
• Choices: NIBSC_11/242, NIBSC_20/170, water_extraction_control, zepto_rp2.1, zymo-mc_D6300
is_approximate_date bool The date is approximate e.g. the sample is from a public repository and it is unclear whether the date corresponds to collection or publishing. • Default: False
batch_id text Used to identify samples prepared in the same laboratory batch (e.g. extraction, library and/or sequencing). • Max length: 100
study_id text Used to identify study or if NHS residual sample. • Max length: 100
study_centre_id text Used to identify sequencing centre. • Max length: 100
sequence_purpose choice Used to differentiate between clinical or research studies. • Choices: clinical, research
governance_status choice Did the patient consent to their sample being used for research purposes or not. • Default: no_consent_for_research
• Choices: consented_for_research, no_consent_for_research, open
iso_country choice Country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB). • Choices: AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, AT, AU, AW, AX, AZ, BA, BB, BD, BE, ...
iso_region choice Region that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB). • Requires: iso_country
• Choices: GB-ABC, GB-ABD, GB-ABE, GB-AGB, GB-AGY, GB-AND, GB-ANN, GB-ANS, GB-BAS, GB-BBD, GB-BCP, GB-BDF, GB-BDG, GB-BEN, GB-BEX, GB-BFS, GB-BGE, GB-BGW, GB-BIR, GB-BKM, ...
extraction_enrichment_protocol text Details of nucleic acid extraction and optional enrichment steps.
library_protocol text Details of sequencing library construction.
sequencing_protocol text Details of sequencing.
bioinformatics_protocol text Detail of initial bioinformatics protocol, for example versions of basecalling software and models used, any read quality filtering/trimming employed.
dehumanisation_protocol text Details of bioinformatics method used for human read removal.
is_public_dataset bool The sample is from a public dataset. Please only set this after it has been made public. • Default: False
public_database_name choice The public repository where the data is. • Choices: ENA, SRA
public_database_accession text The accession for the data in the public database.