All notable changes to CLIMB-TRE mSCAPE APIs, data or interchange formats that have impact to users or other pipelines should be documented in this file. Changes described here may only be a subset of all changes to a project as this log concerns itself only with changes that impact how data is provided or consumed by users or other pipelines. The following DIPI projects are routinely using this CHANGELOG.
Scylla
-- ingest analysis pipelineRoz
-- ingest managementOnyx
-- metadata databaseOnyx-client
-- API for interacting with metadata database
The format is based on Keep a Changelog.
Issues can be reported to the mSCAPE DIPI group.
2025-05-08¶
Scylla¶
Released version 2.0.0. Given the number of changes, they are grouped by category rather than Added/Changed etc.
HCID changes¶
- Add
min_coverage
parameter to HCID JSON - Update references in HCID JSON and reference file
- Update thresholds for HCID detection
- Drop requirement for classified reads at taxon/parent level for HCID to be detected (mapping sufficient)
- Output reads corresponding to HCIDs which have flagged a warning (NEW OUTPUT in
qc/<taxid>.reads.fq
) - Output read stats for HCID reads to the warning JSON (
mapped_mean_quality
andmapped_mean_length
) - Add coverage information for HCID found showing how many bases have coverage at each level - in HCID JSON
Extract taxa changes¶
- Reworked code to interact with kraken reports and assignment files during extract steps. Found a bug where some of the counts in the summary had previously been double counted (where both a S and S1 or S2 level taxa were extracted)
- Extract reads at different levels for different domains as specified by config (
F
for Viruses,G
for everything else) - Only extract reads at the specific level, not sublevels (e.g. S not S1 or S2)
- Add
total_len
calculated both for input and extracted output files in the summary JSON (NEW OUTPUTqc/total_length.json
) - Make extraction percentages domain-specific (e.g. 1% of bacterial reads rather than 1% of classified reads) to fix zepto example
- To extract a taxon, needs to pass count threshold OR the percentage threshold (previously both) and increase the count threshold for bacteria to 500
Workflow changes¶
- Add workflow to reclassify the viral+unclassified fraction with a second database
- In the process, the parameters associated with kraken databases have been restructured. Replace
--k2_host
with--kraken_database.default.host
,--k2_port
with--kraken_database.default.port
,--database
with--kraken_database.default.path
anddatabase_set
is nowkraken_database.default.name
. This allows a second dictionary of kraken parameters forkraken_database.virus
to be defined if/when necessary. - Add code to merge kraken assignment files, giving preference to second assignment file
- Add code to update kraken report, giving list of changes made to assignments
- Add a QC script to check the input file where a single fastq file is provided, so that it can warn if there are duplicate headers. This was seen in some example data and would cause big problems for the viral reclassification step when run, as read names need to be unique. If it finds duplicate/unexpectedly interleaved files, tries to correct them but then exists. The user can try rerunning with the fixed files. I considered silently handling but this approach seemed dangerous.
- Add messaging if paired reads provided and
--paired
not. - Add a workflow to run modules (use
--module <name>
) and remove workflow definitions from within these modules - Add a warning for incorrect Phred parsing as this is thought to be a resolved issue
Nextflow changes¶
- set
docker.userEmulation = true
Other changes¶
- Add to README more helpful
- Review all local test commands and make sure they run as expected.
2025-03-31¶
Onyx¶
Added¶
- Added
nuth
(Newcastle upon Tyne Hospitals NHS Foundation Trust) as an option in the mSCAPEsite
field.
2025-03-06¶
All¶
- Start of changelog