phe.utils

Submodules

Module contents

class BaseStats[source]

Bases: object

Methods

update  
update(position_data, sample, reference)[source]
calculate_memory_for_sort()[source]

Calculate available memory for samtools sort function. If there is enough memory, no temp files are created. Enough is defined as at least 1G per CPU.

Returns:
sort_memory: str or None

String to use directly with -m option in sort, or None.

getTotalNofDiff_tn84(d)[source]

Sum up total number of differences for a dict like the one in the input

Parameters:
d: dict
{‘A’: {‘A’: 1152.0, ‘C’: 114.0, ‘G’: 545.0, ‘T’: 35.0},

‘C’: {‘A’: 0.0, ‘C’: 1233.0, ‘G’: 108.0, ‘T’: 467.0}, ‘G’: {‘A’: 0.0, ‘C’: 0.0, ‘G’: 1283.0, ‘T’: 100.0}, ‘T’: {‘A’: 0.0, ‘C’: 0.0, ‘G’: 0.0, ‘T’: 1177.0}}

Returns:
t: float

the sum of all differences(1369.0 in above example case)

get_difference_value(s1_base, s2_base, sSubs)[source]

Get difference value for a given set of bases.

Parameters:
s1_base: str

a charcater

s2_base: str

a charcater

sSubs: str

distance model

Returns:
difference: float or list

depending on the distance model either a float 1.0 or 0.0 of a list of two floats

get_dist_mat(aSampleNames, avail_pos, dArgs)[source]

Calculates the distance matrix, optionally removes recombination from it and optionally normalises it

Parameters:
aSampleNames: list

list of sample names

avail_pos: dict

infomatin on all available positions {‘gi|194097589|ref|NC_011035.1|’:

FastRBTree({2329: {‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb590>,

‘reference’: ‘A’, ‘211700_H15498026501’: ‘C’, ‘211701_H15510030401’: ‘C’, ‘211702_H15522021601’: ‘C’},

3837: {‘211700_H15498026501’: ‘G’,

‘stats’: <vcf2distancematrix.BaseStats object at 0x40fbf90>, ‘211701_H15510030401’: ‘G’, ‘reference’: ‘T’, ‘211702_H15522021601’: ‘G’},

4140: {‘211700_H15498026501’: ‘A’,

‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb790>, ‘211701_H15510030401’: ‘A’, ‘reference’: ‘G’, ‘211702_H15522021601’: ‘A’}})}

dArgs: dict

input parameter dictionary as created by get_args()

Returns:
call to get_sample_pair_densities with parameters unpacked
get_ref_freqs(ref, len_only=False)[source]

Get the length of the reference genome and optionally the nucleotide frequencies in it

Parameters:
ref: str

reference genome filename

len_only: boolean

get genome lengths only [default FALSE, also get nucleotide frequencies]

Returns:
(dRefFreq, flGenLen): tuple
dRefFreq: dict

dRefFreq = {‘A’: 0.25, ‘C’: 0.24, ‘G’: 0.26, ‘T’: 0.25}

flGenLen: float

genome length

get_sample_pair_densities(sample_1, sample_2, oBT, flGenLen)[source]

Function to calculate the differecnes in a window of a given size around is difference for a given pair

Parameters:
sample_1: str

name of sample 1

sample_2: str

name of sample 2

oBT: obj

bintree object that contains all information for all available positions for a given contig

flGenLen: float

reference genome length

Returns
——-
(diffs, d): tuple
diffs: int

total number of differences between the pair

d: dict

dict with position of difference as key and number of differences in window around it as value

is_uncallable(record)[source]

Is the Record uncallable? Currently the record is uncallable iff:

  • GT field is ./.
  • LowQual is in the filter.
Returns:
uncall: bool

True if any of the above items are true, False otherwise.

normalise_jc69(d, ref, names)[source]

Normalise distance matrix according to the Jukes-Cantor distance model see: Nei and Zhang: Evolutionary Distance: Estimation,

ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 7
Parameters:
d: dict

distance matrix

ref: str

reference genome file name

names: list

list of sample names

Returns
——-
d: dict

normalised matrix

normalise_k80(d, ref, names)[source]

Normalise distance matrix according to the Tajima-Nei distance model see: Nei and Zhang: Evolutionary Distance: Estimation,

ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 9
Parameters:
d: dict

distance matrix

ref: str

reference genome file name

names: list

list of sample names

Returns
——-
d: dict

normalised matrix

normalise_t93(d, ref, names)[source]

Normalise distance matrix according to the Tamura 3-parameter distance model see: Nei and Zhang: Evolutionary Distance: Estimation,

ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 16
Parameters:
d: dict

distance matrix

ref: str

reference genome file name

names: list

list of sample names

Returns
——-
d: dict

normalised matrix

normalise_tn84(d, ref, names)[source]

Normalise distance matrix according to the Kimura 2-parameter distance model

see: Nei and Zhang: Evolutionary Distance: Estimation,
ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 13 and 14
Parameters:
d: dict

distance matrix d = {‘211701_H15510030401’: {‘211700_H15498026501’: {‘A’: {‘A’: 1152.0,

‘C’: 114.0, ‘G’: 545.0, ‘T’: 35.0},

‘C’: {‘A’: 0.0,

‘C’: 1233.0, ‘G’: 108.0, ‘T’: 467.0},

‘G’: {‘A’: 0.0,

‘C’: 0.0, ‘G’: 1283.0, ‘T’: 100.0},

‘T’: {‘A’: 0.0,

‘C’: 0.0, ‘G’: 0.0, ‘T’: 1177.0}}, …}, …}

ref: str

reference genome file name

names: list

list of sample names

Returns
——-
d: dict

normalised matrix

parse_vcf_files(dArgs, avail_pos, aSampleNames)[source]

Parse vcf files to data structure Parameters ———- dArgs: dict

input parameter dictionary as created by get_args()
avail_pos: dict
dict of bintrees for each contig
aSampleNames: list
list of sample names

0 also writes all data to avail_pos

parse_wg_alignment(dArgs, avail_pos, aSampleNames)[source]

Parse alignment to data structure Parameters ———- dArgs: dict

input parameter dictionary as created by get_args()
avail_pos: dict
dict of bintrees for each contig
aSampleNames: list
list of sample names

0 also writes all data to avail_pos

precompute_snp_densities(avail_pos, sample_names, args)[source]

Precompute the number of differences around each difference between each pair of samples

Parameters:
avail_pos: dict

data structure that contains the information on all available positions, like this: {‘gi|194097589|ref|NC_011035.1|’: FastRBTree({2329: {‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb590>,

‘reference’: ‘A’, ‘211700_H15498026501’: ‘C’, ‘211701_H15510030401’: ‘C’, ‘211702_H15522021601’: ‘C’},

3837: {‘211700_H15498026501’: ‘G’,

‘stats’: <vcf2distancematrix.BaseStats object at 0x40fbf90>, ‘211701_H15510030401’: ‘G’, ‘reference’: ‘T’, ‘211702_H15522021601’: ‘G’},

4140: {‘211700_H15498026501’: ‘A’,

‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb790>, ‘211701_H15510030401’: ‘A’, ‘reference’: ‘G’, ‘211702_H15522021601’: ‘A’}})}

sample_names: list

list of sample names

args: dict

input parameter dictionary as created by get_args()

Returns:
dDen: dict

contains the differences between a pair in a window of given size around each difference of the pair {‘diffs’: {‘187534_H153520399-1’: {‘187534_H153520399-1’: 0,

‘187536_H154060132-1’: 1609, ‘189918_H154320283-2’: 295, ‘205683_H15352039901’: 0, ‘211698_H15464036401’: 298, ‘211700_H15498026501’: 298, ‘211701_H15510030401’: 1621, ‘211702_H15522021601’: 297, ‘211703_H15534021301’: 1632, ‘reference’: 4045},

‘187536_H154060132-1’: {‘187536_H154060132-1’: 0,

‘205683_H15352039901’: 1605, ‘211698_H15464036401’: 1353, ‘211701_H15510030401’: 1, ‘211702_H15522021601’: 1351, ‘211703_H15534021301’: 7, ‘reference’: 5041} …, },

‘gi|194097589|ref|NC_011035.1|’: {‘187534_H153520399-1’: {‘187536_H154060132-1’: {55959: 1,

56617: 1, 157165: 1, 279950: 3, 279957: 3, 279959: 3, 608494: 22, 608537: 23, 608551: 23, 608604: 23, 608617: 24, …,}

‘189918_H154320283-2’: {27696: 1,

55959: 1, 56617: 2, 56695: 2, 279950: 3, 279957: 3, 279959: 3, 520610: 1, 608494: 22, …,