HPRU GRE TB Uploader Specification¶

Files to be provided¶

Suppliers must provide:

A VCF file containing the variant calls for the consensus sequence.
A FASTA file containing the consensus sequence in FASTA format.
A CSV file containing the metadata associated with sequencing the sample.

File naming convention¶

The base filenames should be of the form

hprugretb.[run_index].[run_id].[extension]

where:

[run_index] is an identifier that is unique within a sequencing run, e.g. a sequencing barcode identifier, or a 96-well plate co-ordinate.
[run_id] is the name of the sequencing run as given by the supplier's sequencing instrument (not an internal identifier assigned by the supplier).
[extension] is the file extension indicating the file type.

ALL files must be uploaded to the root of the bucket, meaning that subdirectories cannot be used. Any file inside of a subdirectory of a bucket will be ignored.

File name extensions¶

The extensions ([extension]) should be:

vcf for the VCF file.
fasta for the FASTA file.
csv for the CSV metadata file.

Platforms¶

As only consensus sequences are used in this project the sequencing platform is less relevant so there is only one "platform", noplatform.

Valid characters¶

The [run_index], [run_id] and [extension] must contain only:

Letters (A-Z, a-z).
Numbers (0-9).
Hyphens (-).
Underscores (_).

Buckets¶

Bucket names follow the general convention:

hprugretb-[sequencing_org]-noplatform-[test_flag]

If you upload your data to an incorrect bucket, it will not be processed or in the worst case may be processed incorrectly, it is your responsibility to ensure that your data is uploaded correctly!

Metadata specification¶

CSV Template¶

A CSV template for uploaders can be downloaded here: hprugretb-template.csv

Required fields¶

Field	Data type	Description	Restrictions
`run_index`	`text`	The sequencing provider's identifier for the position of a sample on a run.	• Max length: `50`
`run_id`	`text`	Unique identifier assigned to the run by the sequencing instrument.	• Max length: `100`
`platform`	`choice`	The platform used to sequence the data.	• Choices: `no_platform`
`guuid`	`text`	Sample ID assigned by Labkey.	• Max length: `50`
`organism`	`text`	The identified organism.	• Max length: `100`
`plate_name`	`text`	Name of the sequencing plate assigned by the laboratory scientist.	• Max length: `100`
`creation_date`	`date`	Date the sequencing associated record was created in the Labkey database.	• Input formats: `iso-8601` • Output format: `iso-8601`
`fasta_uri`	`text`	URI to the FASTA file in object storage.
`vcf_uri`	`text`	URI to the VCF file in object storage.

Optional fields¶

Field	Data type	Description	Restrictions
`is_published`	`bool`	Indicator for whether an object has been published.	• Default: `True`