Scripts

Useful scripts for data analysis.

phenix

There is a single entry point to access all commands we have produced so far. Please refer to this documentation or –help on the cammand line.

usage: phenix [-h] [--debug] [--version]
              {run_snp_pipeline,filter_vcf,prepare_reference,vcf2fasta,vcf2distancematrix,vcf2json}
              ...

Positional Arguments

cmd Possible choices: run_snp_pipeline, filter_vcf, prepare_reference, vcf2fasta, vcf2distancematrix, vcf2json

Named Arguments

--debug

More verbose logging (default: turned off).

Default: False

--version show program’s version number and exit

Sub-commands:

run_snp_pipeline

Run the snp pipeline with specified mapper, variant caller and some filters.

Available mappers: [‘bwa’, ‘bowtie2’]

Available variant callers: [‘mpileup’, ‘gatk’]

Available filters: [‘gq_score’, ‘dp4_ratio’, ‘ad_ratio’, ‘min_depth’, ‘mq_score’, ‘mq0_ratio’, ‘uncall_gt’, ‘qual_score’, ‘mq0f_ratio’]

Available annotators: [‘coverage’]

phenix run_snp_pipeline [-h] [--workflow WORKFLOW] [--input INPUT] [-r1 R1]
                        [-r2 R2] [--reference REFERENCE]
                        [--sample-name SAMPLE_NAME] [--outdir OUTDIR]
                        [--config CONFIG] [--mapper MAPPER]
                        [--mapper-options MAPPER_OPTIONS] [--bam BAM]
                        [--variant VARIANT]
                        [--variant-options VARIANT_OPTIONS] [--vcf VCF]
                        [--filters FILTERS]
                        [--annotators ANNOTATORS [ANNOTATORS ...]]
                        [--keep-temp] [--json] [--json-info]
Named Arguments
--workflow, -w
--input, -i
-r1 R1/Forward read in Fastq format.
-r2 R2/Reverse read in Fastq format.
--reference, -r
 Rerefence to use for mapping.
--sample-name

Name of the sample for mapper to include as read groups.

Default: “test_sample”

--outdir, -o
--config, -c
--mapper, -m

Available mappers: [‘bwa’, ‘bowtie2’]

Default: “bwa”

--mapper-options
 Custom maper options (advanced)
--bam
--variant, -v

Available variant callers: [‘mpileup’, ‘gatk’]

Default: “gatk”

--variant-options
 Custom variant options (advanced)
--vcf
--filters Filters to be applied to the VCF in key:value pairs, separated by comma (,). Available_filters: [‘gq_score’, ‘dp4_ratio’, ‘ad_ratio’, ‘min_depth’, ‘mq_score’, ‘mq0_ratio’, ‘uncall_gt’, ‘qual_score’, ‘mq0f_ratio’]. Recommendations: GATK: mq_score:30,min_depth:10,ad_ratio:0.9 Mpileup: mq_score:30,min_depth:10,dp4_ratio:0.9
--annotators List of annotators to run before filters. Available: [‘coverage’]
--keep-temp

Keep intermediate files like BAMs and VCFs (default: False).

Default: False

--json

Also write variant positions in filtered vcf as json file (default: False).

Default: False

--json-info

When writing a json file, log some stats to stdout. (default: False).

Default: False

filter_vcf

Filter the VCF using provided filters.

phenix filter_vcf [-h] --vcf VCF [--filters FILTERS | --config CONFIG]
                  --output OUTPUT [--reference REFERENCE] [--only-good]
Named Arguments
--vcf, -v VCF file to (re)filter.
--filters, -f Filter(s) to apply as key:threshold pairs, separated by comma. Recommendations: GATK: mq_score:30,min_depth:10,ad_ratio:0.9 Mpileup: mq_score:30,min_depth:10,dp4_ratio:0.9
--config, -c Config with filters in YAML format. E.g.filters:-key:value
--output, -o Location for filtered VCF to be written.
--reference, -r
 mpileup version <= 1.3 do not output all positions. This is required to fix rfrence base in VCF.
--only-good

Write only variants that PASS all filters (default all variants are written).

Default: False

prepare_reference

Prepare reference for SNP pipeline by generating required aux files.

phenix prepare_reference [-h] --reference REFERENCE [--mapper MAPPER]
                         [--variant VARIANT]
Named Arguments
--reference, -r
 Path to reference file to prepare.
--mapper Available mappers: [‘bwa’, ‘bowtie2’]
--variant Available variants: [‘mpileup’, ‘gatk’]

vcf2fasta

Combine multiple VCFs into a single FASTA file.

phenix vcf2fasta [-h] (--directory DIRECTORY | --input INPUT [INPUT ...])
                 [--regexp REGEXP] --out OUT [--with-mixtures WITH_MIXTURES]
                 [--column-Ns COLUMN_NS] [--column-gaps COLUMN_GAPS]
                 [--sample-Ns SAMPLE_NS] [--sample-gaps SAMPLE_GAPS]
                 [--sample-Ns-gaps-auto-factor SAMPLE_NS_GAPS_AUTO_FACTOR]
                 [--reference REFERENCE | --remove-invariant-npos]
                 [--reflength REFLENGTH]
                 [--include INCLUDE | --exclude EXCLUDE]
                 [--with-stats WITH_STATS]
Named Arguments
--directory, -d
 Path to the directory with .vcf files.
--input, -i List of VCF files to process.
--regexp Regular expression for finding VCFs in a directory.
--out, -o Path to the output FASTA file.
--with-mixtures
 Specify this option with a threshold to output mixtures above this threshold.
--column-Ns Keeps columns with fraction of Ns below specified threshold.
--column-gaps Keeps columns with fraction of Ns below specified threshold.
--sample-Ns Keeps samples with fraction of Ns below specified threshold or put ‘auto’.Fraction expressed as fraction of genome. Requires –reflength or –reference.
--sample-gaps Keeps samples with fraction of gaps below specified threshold or put ‘auto’.Fraction expressed as fraction of genome. Requires –reflength or –reference.
--sample-Ns-gaps-auto-factor
 

When using ‘auto’ option for –sample-gaps or –sample-Ns, remove sample that havegaps or Ns this many times above the stddev of all samples. [Default: 2.0]

Default: 2.0

--reference If path to reference specified (FASTA), then whole genome will be written to alignment.
--remove-invariant-npos
 

Remove all positions that invariant apart from N positions.

Default: False

--reflength Length of reference. Either as int or can be worked out from fasta file. Ignored if –reference is used.
--include Only include positions in BED file in the FASTA
--exclude Exclude any positions specified in the BED file.
--with-stats If a path is specified, then position of the outputed SNPs is stored in this file.

vcf2distancematrix

Combine multiple VCFs into a distance matrix.

Distance measures according to five different models are available: * Number of differences

  • Jukes-Cantor distance (jc69)
  • Tajima-Nei distance (k80)
  • Kimura 2-parameter distance (tn84)
  • Tamura 3-parameter distance (t93)
phenix vcf2distancematrix [-h]
                          (--directory DIRECTORY | --input INPUT [INPUT ...] | --alignment-input MULTI FASTA FILE)
                          --out OUT [--deletion STRING]
                          [--substitution STRING]
                          [--include BED FILE | --exclude BED FILE]
                          [--remove-recombination] [--refgenome FASTA FILE]
                          [--refgenomename STRING] [--threshold FLOAT]
                          [--threads INT] [--format STRING] [--tree FILE]
                          [--with-stats]
Named Arguments
--directory, -d
 Path to the directory with .vcf files. Input option 1.
--input, -i List of VCF files to process. Input option 2.
--alignment-input, -a
 Multi fasta file with whole genome input alignment. Input option 3.
--out, -o Path to the maxtrix output file in given format. [REQUIRED. default format is tab separated. use –format to change format]
--deletion

Possible choices: pairwise, complete

Method of recombination filtering. Either ‘pairwise’ or ‘complete’ [‘pairwise’]

Default: “pairwise”

--substitution

Possible choices: number_of_differences, jc69, k80, tn84, t93

Substituition model. Either ‘number_of_differences’, ‘jc69’, ‘k80’, ‘tn84’ or ‘t93’ [‘number_of_differences’]

Default: “number_of_differences”

--include Only include positions in BED file in the FASTA
--exclude Exclude any positions specified in the BED file.
--remove-recombination
 

Attempt to remove recombination from distance matrix. [don’t]

Default: False

--refgenome, -g
 Reference genome used for SNP calling [Required for recombination removal].
--refgenomename, -n
 Name of reference genome in input alignment [Required if input option 3 is used and reference is not named ‘reference’].
--threshold, -k
 

Density tyhreshold above mean density for relevant pair. [1.0].

Default: 1.0

--threads

Number of threads to use. [1].

Default: 1

--format

Possible choices: tsv, csv, mega

Change format for output file. Available options csv, tsv and mega. [tsv]

Default: “tsv”

--tree, -t Make an NJ tree and write it to the given file in newick format. [Default: Don’t make tree, only matrix]
--with-stats

Write additional files with information on removed recombinant SNPs. [don’t]

Default: False

vcf2json

Converts the postions of variants and ignored/missing positions in either a ‘raw’ or filtered VCF
file to a json string and writes it to a file. The json contains 6 arrays for each chromosome in the VCF file: g_positions, a_positions, t_positions, c_positions, gap_positions, n_positions
phenix vcf2json [-h] --input INPUT [--output_file_prefix OUTPUT_FILE_PREFIX]
                [--nozip] [--vcf_is_filtered] [--summary_info]
Named Arguments
--input, -i path to a VCF file
--output_file_prefix, -o
 

Path to the json output file (without file extension). Default: sample_name

Default: “sample_name”

--nozip, -z

Do not gzip json when writing file. (default: Yes, gzip it.)

Default: False

--vcf_is_filtered, -f
 

Required: Confirm that the input vcf is filtered. It is stronglyrecommended to filter the file with Phenix using the sameparameters that are used throughout the database this jason file is meant for.

Default: False

--summary_info, -s
 

Print summary of the json string

Default: False