PATH-SAFE Uploader Specification¶

Files to be provided¶

A FASTQ 1 file containing the forward sequencing reads.
A FASTQ 2 file containing the reverse sequencing reads.
A CSV file containing the metadata associated with sequencing the sample.

File naming convention¶

The base filenames should be of the form

pathsafe.[run_index].[run_id].[extension]

where:

[run_index] is an identifier that is unique within a sequencing run, e.g. a sequencing barcode identifier, or a 96-well plate co-ordinate.
[run_id] is the name of the sequencing run as given by the supplier's sequencing instrument (not an internal identifier assigned by the supplier).
[extension] is the file extension indicating the file type.

File name extensions¶

The extensions ([extension]) should be:

1.fastq.gz for the forward FASTQ file.
2.fastq.gz for the reverse FASTQ file.
csv for the CSV metadata file.

Valid characters¶

The [run_index], [run_id] and [extension] must contain only:

Letters (A-Z, a-z).
Numbers (0-9).
Hyphens (-).
Underscores (_).

Metadata specification¶

CSV Template¶

A CSV template for uploaders can be downloaded here: pathsafe-template.csv

Required fields¶

Field	Data type	Description	Restrictions
`biosample_id`	`text`	The sequencing providers identifier for a sample.	• Max length: `50`
`run_index`	`text`	The sequencing provider's identifier for the position of a sample on a run.	• Max length: `50`
`run_id`	`text`	The unique identifier assigned to the run by the sequencing instrument.	• Max length: `100`
`submitted_species`	`choice`	The NCBI taxonomy id provided for the sample.	• Choices: `1639`, `28901`, `562`
`year`	`integer`	Year of sample collected if available or year of sample receipt otherwise.	• Min value: `2000`
`data_steward`	`choice`	Laboratory, organisation or agency that hold the data for the sample.	• Choices: `APHA`, `FSA`, `FSS`, `OTHER`, `PHS`, `PHW`, `SEPA`, `SSSCDRL`, `UKHSA`
`source_type`	`choice`	Source of the sample.	• Choices: `animal`, `animal_associated_environment`, `environment`, `food`, `food_associated_environment`, `human`, `human_associated_environment`, `missing`, `not_applicable`, `not_collected`, `not_provided`, `other`, `other_environment`, `restricted_access`
`country`	`choice`	The country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB).	• Choices: `GB`, `GB-ENG`, `GB-NIR`, `GB-SCT`, `GB-WLS`
`sample_purpose`	`choice`	The purpose of the sample collection.	• Choices: `active_surveillance`, `not_applicable`, `not_collected`, `not_provided`, `other`, `outbreak_initiated_surveillance`, `outbreak_investigation`, `population_based_surveillance`, `research`, `restricted_access`, `routine_diagnostics`, `routine_surveillance`

Optional fields¶

Field	Data type	Description	Restrictions
`biosample_source_id`	`text`	Unique identifier for an individual to permit multiple samples from the same individual to be linked.	• Max length: `50`
`sample_accession`	`text`	Sample accession number if sequence is publically available in SRA.
`enterobase_barcode`	`text`	Sample barcode if sequence is publically available in EnteroBase.
`collection_date`	`date`	Date of sample collection.	• Input formats: `YYYY-MM` • Output format: `YYYY-MM`
`receipt_date`	`date`	Date of receipt of the sample.	• Input formats: `YYYY-MM` • Output format: `YYYY-MM`
`month`	`integer`	Month of sample collected if available or month of receipt otherwise.	• Min value: `1` • Max value: `12`
`sequence_org`	`choice`	Laboratory, organisation or agency the sample has been sequenced by.	• Choices: `APHA`, `FSA`, `FSS`, `OTHER`, `PHS`, `PHW`, `SEPA`, `SSSCDRL`, `UKHSA`
`sequence_org_other`	`text`	Additional laboratory, organisation or agency the sample has been sequenced by.	• Requires: `sequence_org` • Required when `sequence_org` is: `OTHER`
`data_steward_other`	`text`	Additional laboratory, organisation or agency that hold the data for the sample.	• Requires: `data_steward` • Required when `data_steward` is: `OTHER`
`county`	`choice`	County that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB).	• Requires: `country` • Choices: `GB-ABC`, `GB-ABD`, `GB-ABE`, `GB-AGB`, `GB-AGY`, `GB-AND`, `GB-ANN`, `GB-ANS`, `GB-BAS`, `GB-BBD`, `GB-BCP`, `GB-BDF`, `GB-BDG`, `GB-BEN`, `GB-BEX`, `GB-BFS`, `GB-BGE`, `GB-BGW`, `GB-BIR`, `GB-BKM`, ...
`sample_purpose_other`	`text`	Additional purpose of the sample collection.	• Requires: `sample_purpose` • Required when `sample_purpose` is: `other`
`sequencing_kit`	`text`	The sequencing kit used.
`library_kit`	`text`	The library kit used to prep the sample.
`is_multiplexed`	`bool`	Whether the sample was multiplexed.
`type_of_sample`	`choice`	Type of sample used to produce the sequence.	• Default: `genomic` • Choices: `genomic`