PATH-SAFE Uploader Specification¶
Files to be provided¶
- A FASTQ 1 file containing the forward sequencing reads.
- A FASTQ 2 file containing the reverse sequencing reads.
- A CSV file containing the metadata associated with sequencing the sample.
File naming convention¶
The base filenames should be of the form
pathsafe.[run_index].[run_id].[extension]
where:
[run_index]is an identifier that is unique within a sequencing run, e.g. a sequencing barcode identifier, or a 96-well plate co-ordinate.[run_id]is the name of the sequencing run as given by the supplier's sequencing instrument (not an internal identifier assigned by the supplier).[extension]is the file extension indicating the file type.
File name extensions¶
The extensions ([extension]) should be:
1.fastq.gzfor the forward FASTQ file.2.fastq.gzfor the reverse FASTQ file.csvfor the CSV metadata file.
Valid characters¶
The [run_index], [run_id] and [extension] must contain only:
- Letters (
A-Z,a-z). - Numbers (
0-9). - Hyphens (
-). - Underscores (
_).
Metadata specification¶
CSV Template¶
A CSV template for uploaders can be downloaded here: pathsafe-template.csv
Required fields¶
| Field | Data type | Description | Restrictions |
|---|---|---|---|
biosample_id |
text |
The sequencing providers identifier for a sample. | • Max length: 50 |
run_index |
text |
The sequencing provider's identifier for the position of a sample on a run. | • Max length: 50 |
run_id |
text |
The unique identifier assigned to the run by the sequencing instrument. | • Max length: 100 |
submitted_species |
choice |
The NCBI taxonomy id provided for the sample. | • Choices: 1639, 28901, 562 |
year |
integer |
Year of sample collected if available or year of sample receipt otherwise. | • Min value: 2000 |
data_steward |
choice |
Laboratory, organisation or agency that hold the data for the sample. | • Choices: APHA, FSA, FSS, OTHER, PHS, PHW, SEPA, SSSCDRL, UKHSA |
source_type |
choice |
Source of the sample. | • Choices: animal, animal_associated_environment, environment, food, food_associated_environment, human, human_associated_environment, missing, not_applicable, not_collected, not_provided, other, other_environment, restricted_access |
country |
choice |
The country that the sample was collected in, using ISO-3166-1 alpha-2 codes (https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes), unless within United Kingdom. If so, use ISO-3166-2:GB (https://en.wikipedia.org/wiki/ISO_3166-2:GB). | • Choices: GB, GB-ENG, GB-NIR, GB-SCT, GB-WLS |
sample_purpose |
choice |
The purpose of the sample collection. | • Choices: active_surveillance, not_applicable, not_collected, not_provided, other, outbreak_initiated_surveillance, outbreak_investigation, population_based_surveillance, research, restricted_access, routine_diagnostics, routine_surveillance |
Optional fields¶
| Field | Data type | Description | Restrictions |
|---|---|---|---|
biosample_source_id |
text |
Unique identifier for an individual to permit multiple samples from the same individual to be linked. | • Max length: 50 |
sample_accession |
text |
Sample accession number if sequence is publically available in SRA. | |
enterobase_barcode |
text |
Sample barcode if sequence is publically available in EnteroBase. | |
collection_date |
date |
Date of sample collection. | • Input formats: YYYY-MM• Output format: YYYY-MM |
receipt_date |
date |
Date of receipt of the sample. | • Input formats: YYYY-MM• Output format: YYYY-MM |
month |
integer |
Month of sample collected if available or month of receipt otherwise. | • Min value: 1• Max value: 12 |
sequence_org |
choice |
Laboratory, organisation or agency the sample has been sequenced by. | • Choices: APHA, FSA, FSS, OTHER, PHS, PHW, SEPA, SSSCDRL, UKHSA |
sequence_org_other |
text |
Additional laboratory, organisation or agency the sample has been sequenced by. | • Requires: sequence_org• Required when sequence_org is: OTHER |
data_steward_other |
text |
Additional laboratory, organisation or agency that hold the data for the sample. | • Requires: data_steward• Required when data_steward is: OTHER |
county |
choice |
County that the sample was collected in, using the second level subdivision codes of ISO-3166-2:GB (https://www.iso.org/obp/ui/#iso:code:3166:GB). | • Requires: country• Choices: GB-ABC, GB-ABD, GB-ABE, GB-AGB, GB-AGY, GB-AND, GB-ANN, GB-ANS, GB-BAS, GB-BBD, GB-BCP, GB-BDF, GB-BDG, GB-BEN, GB-BEX, GB-BFS, GB-BGE, GB-BGW, GB-BIR, GB-BKM, ... |
sample_purpose_other |
text |
Additional purpose of the sample collection. | • Requires: sample_purpose• Required when sample_purpose is: other |
sequencing_kit |
text |
The sequencing kit used. | |
library_kit |
text |
The library kit used to prep the sample. | |
is_multiplexed |
bool |
Whether the sample was multiplexed. | |
type_of_sample |
choice |
Type of sample used to produce the sequence. | • Default: genomic• Choices: genomic |