Scripts¶
Useful scripts for data analysis.
phenix¶
There is a single entry point to access all commands we have produced so far. Please refer to this documentation or –help on the cammand line.
usage: phenix [-h] [--debug] [--version]
{run_snp_pipeline,filter_vcf,prepare_reference,vcf2fasta,vcf2distancematrix,vcf2json}
...
Positional Arguments¶
cmd | Possible choices: run_snp_pipeline, filter_vcf, prepare_reference, vcf2fasta, vcf2distancematrix, vcf2json |
Named Arguments¶
--debug | More verbose logging (default: turned off). Default: False |
--version | show program’s version number and exit |
Sub-commands:¶
run_snp_pipeline¶
Run the snp pipeline with specified mapper, variant caller and some filters.
Available mappers: [‘bwa’, ‘bowtie2’]
Available variant callers: [‘mpileup’, ‘gatk’]
Available filters: [‘gq_score’, ‘dp4_ratio’, ‘ad_ratio’, ‘min_depth’, ‘mq_score’, ‘mq0_ratio’, ‘uncall_gt’, ‘qual_score’, ‘mq0f_ratio’]
Available annotators: [‘coverage’]
phenix run_snp_pipeline [-h] [--workflow WORKFLOW] [--input INPUT] [-r1 R1]
[-r2 R2] [--reference REFERENCE]
[--sample-name SAMPLE_NAME] [--outdir OUTDIR]
[--config CONFIG] [--mapper MAPPER]
[--mapper-options MAPPER_OPTIONS] [--bam BAM]
[--variant VARIANT]
[--variant-options VARIANT_OPTIONS] [--vcf VCF]
[--filters FILTERS]
[--annotators ANNOTATORS [ANNOTATORS ...]]
[--keep-temp] [--json] [--json-info]
Named Arguments¶
--workflow, -w | |
--input, -i | |
-r1 | R1/Forward read in Fastq format. |
-r2 | R2/Reverse read in Fastq format. |
--reference, -r | |
Rerefence to use for mapping. | |
--sample-name | Name of the sample for mapper to include as read groups. Default: “test_sample” |
--outdir, -o | |
--config, -c | |
--mapper, -m | Available mappers: [‘bwa’, ‘bowtie2’] Default: “bwa” |
--mapper-options | |
Custom maper options (advanced) | |
--bam | |
--variant, -v | Available variant callers: [‘mpileup’, ‘gatk’] Default: “gatk” |
--variant-options | |
Custom variant options (advanced) | |
--vcf | |
--filters | Filters to be applied to the VCF in key:value pairs, separated by comma (,). Available_filters: [‘gq_score’, ‘dp4_ratio’, ‘ad_ratio’, ‘min_depth’, ‘mq_score’, ‘mq0_ratio’, ‘uncall_gt’, ‘qual_score’, ‘mq0f_ratio’]. Recommendations: GATK: mq_score:30,min_depth:10,ad_ratio:0.9 Mpileup: mq_score:30,min_depth:10,dp4_ratio:0.9 |
--annotators | List of annotators to run before filters. Available: [‘coverage’] |
--keep-temp | Keep intermediate files like BAMs and VCFs (default: False). Default: False |
--json | Also write variant positions in filtered vcf as json file (default: False). Default: False |
--json-info | When writing a json file, log some stats to stdout. (default: False). Default: False |
filter_vcf¶
Filter the VCF using provided filters.
phenix filter_vcf [-h] --vcf VCF [--filters FILTERS | --config CONFIG]
--output OUTPUT [--reference REFERENCE] [--only-good]
Named Arguments¶
--vcf, -v | VCF file to (re)filter. |
--filters, -f | Filter(s) to apply as key:threshold pairs, separated by comma. Recommendations: GATK: mq_score:30,min_depth:10,ad_ratio:0.9 Mpileup: mq_score:30,min_depth:10,dp4_ratio:0.9 |
--config, -c | Config with filters in YAML format. E.g.filters:-key:value |
--output, -o | Location for filtered VCF to be written. |
--reference, -r | |
mpileup version <= 1.3 do not output all positions. This is required to fix rfrence base in VCF. | |
--only-good | Write only variants that PASS all filters (default all variants are written). Default: False |
prepare_reference¶
Prepare reference for SNP pipeline by generating required aux files.
phenix prepare_reference [-h] --reference REFERENCE [--mapper MAPPER]
[--variant VARIANT]
Named Arguments¶
--reference, -r | |
Path to reference file to prepare. | |
--mapper | Available mappers: [‘bwa’, ‘bowtie2’] |
--variant | Available variants: [‘mpileup’, ‘gatk’] |
vcf2fasta¶
Combine multiple VCFs into a single FASTA file.
phenix vcf2fasta [-h] (--directory DIRECTORY | --input INPUT [INPUT ...])
[--regexp REGEXP] --out OUT [--with-mixtures WITH_MIXTURES]
[--column-Ns COLUMN_NS] [--column-gaps COLUMN_GAPS]
[--sample-Ns SAMPLE_NS] [--sample-gaps SAMPLE_GAPS]
[--sample-Ns-gaps-auto-factor SAMPLE_NS_GAPS_AUTO_FACTOR]
[--reference REFERENCE | --remove-invariant-npos]
[--reflength REFLENGTH]
[--include INCLUDE | --exclude EXCLUDE]
[--with-stats WITH_STATS]
Named Arguments¶
--directory, -d | |
Path to the directory with .vcf files. | |
--input, -i | List of VCF files to process. |
--regexp | Regular expression for finding VCFs in a directory. |
--out, -o | Path to the output FASTA file. |
--with-mixtures | |
Specify this option with a threshold to output mixtures above this threshold. | |
--column-Ns | Keeps columns with fraction of Ns below specified threshold. |
--column-gaps | Keeps columns with fraction of Ns below specified threshold. |
--sample-Ns | Keeps samples with fraction of Ns below specified threshold or put ‘auto’.Fraction expressed as fraction of genome. Requires –reflength or –reference. |
--sample-gaps | Keeps samples with fraction of gaps below specified threshold or put ‘auto’.Fraction expressed as fraction of genome. Requires –reflength or –reference. |
--sample-Ns-gaps-auto-factor | |
When using ‘auto’ option for –sample-gaps or –sample-Ns, remove sample that havegaps or Ns this many times above the stddev of all samples. [Default: 2.0] Default: 2.0 | |
--reference | If path to reference specified (FASTA), then whole genome will be written to alignment. |
--remove-invariant-npos | |
Remove all positions that invariant apart from N positions. Default: False | |
--reflength | Length of reference. Either as int or can be worked out from fasta file. Ignored if –reference is used. |
--include | Only include positions in BED file in the FASTA |
--exclude | Exclude any positions specified in the BED file. |
--with-stats | If a path is specified, then position of the outputed SNPs is stored in this file. |
vcf2distancematrix¶
- Combine multiple VCFs into a distance matrix.
Distance measures according to five different models are available: * Number of differences
- Jukes-Cantor distance (jc69)
- Tajima-Nei distance (k80)
- Kimura 2-parameter distance (tn84)
- Tamura 3-parameter distance (t93)
phenix vcf2distancematrix [-h]
(--directory DIRECTORY | --input INPUT [INPUT ...] | --alignment-input MULTI FASTA FILE)
--out OUT [--deletion STRING]
[--substitution STRING]
[--include BED FILE | --exclude BED FILE]
[--remove-recombination] [--refgenome FASTA FILE]
[--refgenomename STRING] [--threshold FLOAT]
[--threads INT] [--format STRING] [--tree FILE]
[--with-stats]
Named Arguments¶
--directory, -d | |
Path to the directory with .vcf files. Input option 1. | |
--input, -i | List of VCF files to process. Input option 2. |
--alignment-input, -a | |
Multi fasta file with whole genome input alignment. Input option 3. | |
--out, -o | Path to the maxtrix output file in given format. [REQUIRED. default format is tab separated. use –format to change format] |
--deletion | Possible choices: pairwise, complete Method of recombination filtering. Either ‘pairwise’ or ‘complete’ [‘pairwise’] Default: “pairwise” |
--substitution | Possible choices: number_of_differences, jc69, k80, tn84, t93 Substituition model. Either ‘number_of_differences’, ‘jc69’, ‘k80’, ‘tn84’ or ‘t93’ [‘number_of_differences’] Default: “number_of_differences” |
--include | Only include positions in BED file in the FASTA |
--exclude | Exclude any positions specified in the BED file. |
--remove-recombination | |
Attempt to remove recombination from distance matrix. [don’t] Default: False | |
--refgenome, -g | |
Reference genome used for SNP calling [Required for recombination removal]. | |
--refgenomename, -n | |
Name of reference genome in input alignment [Required if input option 3 is used and reference is not named ‘reference’]. | |
--threshold, -k | |
Density tyhreshold above mean density for relevant pair. [1.0]. Default: 1.0 | |
--threads | Number of threads to use. [1]. Default: 1 |
--format | Possible choices: tsv, csv, mega Change format for output file. Available options csv, tsv and mega. [tsv] Default: “tsv” |
--tree, -t | Make an NJ tree and write it to the given file in newick format. [Default: Don’t make tree, only matrix] |
--with-stats | Write additional files with information on removed recombinant SNPs. [don’t] Default: False |
vcf2json¶
- Converts the postions of variants and ignored/missing positions in either a ‘raw’ or filtered VCF
- file to a json string and writes it to a file. The json contains 6 arrays for each chromosome in the VCF file: g_positions, a_positions, t_positions, c_positions, gap_positions, n_positions
phenix vcf2json [-h] --input INPUT [--output_file_prefix OUTPUT_FILE_PREFIX]
[--nozip] [--vcf_is_filtered] [--summary_info]
Named Arguments¶
--input, -i | path to a VCF file |
--output_file_prefix, -o | |
Path to the json output file (without file extension). Default: sample_name Default: “sample_name” | |
--nozip, -z | Do not gzip json when writing file. (default: Yes, gzip it.) Default: False |
--vcf_is_filtered, -f | |
Required: Confirm that the input vcf is filtered. It is stronglyrecommended to filter the file with Phenix using the sameparameters that are used throughout the database this jason file is meant for. Default: False | |
--summary_info, -s | |
Print summary of the json string Default: False |