phe.utils¶

Submodules¶

phe.utils.reader

Module contents¶

class BaseStats[source]¶

Bases: object

Methods

update

update(position_data, sample, reference)[source]¶

calculate_memory_for_sort()[source]¶

Calculate available memory for samtools sort function. If there is enough memory, no temp files are created. Enough is defined as at least 1G per CPU.

Returns:	sort_memory: str or None String to use directly with -m option in sort, or None.

getTotalNofDiff_tn84(d)[source]¶

Sum up total number of differences for a dict like the one in the input

Parameters:	d: dict {‘A’: {‘A’: 1152.0, ‘C’: 114.0, ‘G’: 545.0, ‘T’: 35.0}, ‘C’: {‘A’: 0.0, ‘C’: 1233.0, ‘G’: 108.0, ‘T’: 467.0}, ‘G’: {‘A’: 0.0, ‘C’: 0.0, ‘G’: 1283.0, ‘T’: 100.0}, ‘T’: {‘A’: 0.0, ‘C’: 0.0, ‘G’: 0.0, ‘T’: 1177.0}}
Returns:	t: float the sum of all differences(1369.0 in above example case)

get_difference_value(s1_base, s2_base, sSubs)[source]¶

Get difference value for a given set of bases.

Parameters:	s1_base: str a charcater s2_base: str a charcater sSubs: str distance model
Returns:	difference: float or list depending on the distance model either a float 1.0 or 0.0 of a list of two floats

get_dist_mat(aSampleNames, avail_pos, dArgs)[source]¶

Calculates the distance matrix, optionally removes recombination from it and optionally normalises it

Parameters:

aSampleNames: list

list of sample names

avail_pos: dict

infomatin on all available positions {‘gi|194097589|ref|NC_011035.1|’:

FastRBTree({2329: {‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb590>,

‘reference’: ‘A’, ‘211700_H15498026501’: ‘C’, ‘211701_H15510030401’: ‘C’, ‘211702_H15522021601’: ‘C’},

3837: {‘211700_H15498026501’: ‘G’,

‘stats’: <vcf2distancematrix.BaseStats object at 0x40fbf90>, ‘211701_H15510030401’: ‘G’, ‘reference’: ‘T’, ‘211702_H15522021601’: ‘G’},

4140: {‘211700_H15498026501’: ‘A’,

‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb790>, ‘211701_H15510030401’: ‘A’, ‘reference’: ‘G’, ‘211702_H15522021601’: ‘A’}})}

dArgs: dict

input parameter dictionary as created by get_args()

Returns:

call to get_sample_pair_densities with parameters unpacked

get_ref_freqs(ref, len_only=False)[source]¶

Get the length of the reference genome and optionally the nucleotide frequencies in it

Parameters:	ref: str reference genome filename len_only: boolean get genome lengths only [default FALSE, also get nucleotide frequencies]
Returns:	(dRefFreq, flGenLen): tuple dRefFreq: dict dRefFreq = {‘A’: 0.25, ‘C’: 0.24, ‘G’: 0.26, ‘T’: 0.25} flGenLen: float genome length

get_sample_pair_densities(sample_1, sample_2, oBT, flGenLen)[source]¶

Function to calculate the differecnes in a window of a given size around is difference for a given pair

Parameters:

sample_1: str

name of sample 1

sample_2: str

name of sample 2

oBT: obj

bintree object that contains all information for all available positions for a given contig

flGenLen: float

reference genome length

Returns

——-

(diffs, d): tuple

diffs: int: total number of differences between the pair
d: dict: dict with position of difference as key and number of differences in window around it as value

is_uncallable(record)[source]¶

Is the Record uncallable? Currently the record is uncallable iff:

GT field is ./.

LowQual is in the filter.

Returns:	uncall: bool True if any of the above items are true, False otherwise.

normalise_jc69(d, ref, names)[source]¶

Normalise distance matrix according to the Jukes-Cantor distance model see: Nei and Zhang: Evolutionary Distance: Estimation,

ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 7

Parameters:	d: dict distance matrix ref: str reference genome file name names: list list of sample names Returns ——- d: dict normalised matrix

normalise_k80(d, ref, names)[source]¶

Normalise distance matrix according to the Tajima-Nei distance model see: Nei and Zhang: Evolutionary Distance: Estimation,

ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 9

Parameters:	d: dict distance matrix ref: str reference genome file name names: list list of sample names Returns ——- d: dict normalised matrix

normalise_t93(d, ref, names)[source]¶

Normalise distance matrix according to the Tamura 3-parameter distance model see: Nei and Zhang: Evolutionary Distance: Estimation,

ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 16

Parameters:	d: dict distance matrix ref: str reference genome file name names: list list of sample names Returns ——- d: dict normalised matrix

normalise_tn84(d, ref, names)[source]¶

Normalise distance matrix according to the Kimura 2-parameter distance model

see: Nei and Zhang: Evolutionary Distance: Estimation,: ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 13 and 14

Parameters:

d: dict

distance matrix d = {‘211701_H15510030401’: {‘211700_H15498026501’: {‘A’: {‘A’: 1152.0,

‘C’: 114.0, ‘G’: 545.0, ‘T’: 35.0},

‘C’: {‘A’: 0.0,

‘C’: 1233.0, ‘G’: 108.0, ‘T’: 467.0},

‘G’: {‘A’: 0.0,

‘C’: 0.0, ‘G’: 1283.0, ‘T’: 100.0},

‘T’: {‘A’: 0.0,

‘C’: 0.0, ‘G’: 0.0, ‘T’: 1177.0}}, …}, …}

ref: str

reference genome file name

names: list

list of sample names

Returns

——-

d: dict

normalised matrix

parse_vcf_files(dArgs, avail_pos, aSampleNames)[source]¶

Parse vcf files to data structure Parameters ———- dArgs: dict

input parameter dictionary as created by get_args()

avail_pos: dict: dict of bintrees for each contig
aSampleNames: list: list of sample names

0 also writes all data to avail_pos

parse_wg_alignment(dArgs, avail_pos, aSampleNames)[source]¶

Parse alignment to data structure Parameters ———- dArgs: dict

input parameter dictionary as created by get_args()

avail_pos: dict: dict of bintrees for each contig
aSampleNames: list: list of sample names

0 also writes all data to avail_pos

precompute_snp_densities(avail_pos, sample_names, args)[source]¶

Precompute the number of differences around each difference between each pair of samples

Parameters:

avail_pos: dict

data structure that contains the information on all available positions, like this: {‘gi|194097589|ref|NC_011035.1|’: FastRBTree({2329: {‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb590>,

‘reference’: ‘A’, ‘211700_H15498026501’: ‘C’, ‘211701_H15510030401’: ‘C’, ‘211702_H15522021601’: ‘C’},

3837: {‘211700_H15498026501’: ‘G’,

‘stats’: <vcf2distancematrix.BaseStats object at 0x40fbf90>, ‘211701_H15510030401’: ‘G’, ‘reference’: ‘T’, ‘211702_H15522021601’: ‘G’},

4140: {‘211700_H15498026501’: ‘A’,

‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb790>, ‘211701_H15510030401’: ‘A’, ‘reference’: ‘G’, ‘211702_H15522021601’: ‘A’}})}

sample_names: list

list of sample names

args: dict

input parameter dictionary as created by get_args()

Returns:

dDen: dict

contains the differences between a pair in a window of given size around each difference of the pair {‘diffs’: {‘187534_H153520399-1’: {‘187534_H153520399-1’: 0,

‘187536_H154060132-1’: 1609, ‘189918_H154320283-2’: 295, ‘205683_H15352039901’: 0, ‘211698_H15464036401’: 298, ‘211700_H15498026501’: 298, ‘211701_H15510030401’: 1621, ‘211702_H15522021601’: 297, ‘211703_H15534021301’: 1632, ‘reference’: 4045},

‘187536_H154060132-1’: {‘187536_H154060132-1’: 0,

‘205683_H15352039901’: 1605, ‘211698_H15464036401’: 1353, ‘211701_H15510030401’: 1, ‘211702_H15522021601’: 1351, ‘211703_H15534021301’: 7, ‘reference’: 5041} …, },

‘gi|194097589|ref|NC_011035.1|’: {‘187534_H153520399-1’: {‘187536_H154060132-1’: {55959: 1,

56617: 1, 157165: 1, 279950: 3, 279957: 3, 279959: 3, 608494: 22, 608537: 23, 608551: 23, 608604: 23, 608617: 24, …,}

‘189918_H154320283-2’: {27696: 1,

55959: 1, 56617: 2, 56695: 2, 279950: 3, 279957: 3, 279959: 3, 520610: 1, 608494: 22, …,