phe.utils¶
Submodules¶
Module contents¶
-
calculate_memory_for_sort
()[source]¶ Calculate available memory for
samtools sort
function. If there is enough memory, no temp files are created. Enough is defined as at least 1G per CPU.Returns: - sort_memory: str or None
String to use directly with -m option in sort, or None.
-
getTotalNofDiff_tn84
(d)[source]¶ Sum up total number of differences for a dict like the one in the input
Parameters: - d: dict
- {‘A’: {‘A’: 1152.0, ‘C’: 114.0, ‘G’: 545.0, ‘T’: 35.0},
‘C’: {‘A’: 0.0, ‘C’: 1233.0, ‘G’: 108.0, ‘T’: 467.0}, ‘G’: {‘A’: 0.0, ‘C’: 0.0, ‘G’: 1283.0, ‘T’: 100.0}, ‘T’: {‘A’: 0.0, ‘C’: 0.0, ‘G’: 0.0, ‘T’: 1177.0}}
Returns: - t: float
the sum of all differences(1369.0 in above example case)
-
get_difference_value
(s1_base, s2_base, sSubs)[source]¶ Get difference value for a given set of bases.
Parameters: - s1_base: str
a charcater
- s2_base: str
a charcater
- sSubs: str
distance model
Returns: - difference: float or list
depending on the distance model either a float 1.0 or 0.0 of a list of two floats
-
get_dist_mat
(aSampleNames, avail_pos, dArgs)[source]¶ Calculates the distance matrix, optionally removes recombination from it and optionally normalises it
Parameters: - aSampleNames: list
list of sample names
- avail_pos: dict
infomatin on all available positions {‘gi|194097589|ref|NC_011035.1|’:
- FastRBTree({2329: {‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb590>,
‘reference’: ‘A’, ‘211700_H15498026501’: ‘C’, ‘211701_H15510030401’: ‘C’, ‘211702_H15522021601’: ‘C’},
- 3837: {‘211700_H15498026501’: ‘G’,
‘stats’: <vcf2distancematrix.BaseStats object at 0x40fbf90>, ‘211701_H15510030401’: ‘G’, ‘reference’: ‘T’, ‘211702_H15522021601’: ‘G’},
- 4140: {‘211700_H15498026501’: ‘A’,
‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb790>, ‘211701_H15510030401’: ‘A’, ‘reference’: ‘G’, ‘211702_H15522021601’: ‘A’}})}
- dArgs: dict
input parameter dictionary as created by get_args()
Returns: - call to get_sample_pair_densities with parameters unpacked
-
get_ref_freqs
(ref, len_only=False)[source]¶ Get the length of the reference genome and optionally the nucleotide frequencies in it
Parameters: - ref: str
reference genome filename
- len_only: boolean
get genome lengths only [default FALSE, also get nucleotide frequencies]
Returns: - (dRefFreq, flGenLen): tuple
- dRefFreq: dict
dRefFreq = {‘A’: 0.25, ‘C’: 0.24, ‘G’: 0.26, ‘T’: 0.25}
- flGenLen: float
genome length
-
get_sample_pair_densities
(sample_1, sample_2, oBT, flGenLen)[source]¶ Function to calculate the differecnes in a window of a given size around is difference for a given pair
Parameters: - sample_1: str
name of sample 1
- sample_2: str
name of sample 2
- oBT: obj
bintree object that contains all information for all available positions for a given contig
- flGenLen: float
reference genome length
- Returns
- ——-
- (diffs, d): tuple
- diffs: int
total number of differences between the pair
- d: dict
dict with position of difference as key and number of differences in window around it as value
-
is_uncallable
(record)[source]¶ Is the Record uncallable? Currently the record is uncallable iff:
- GT field is ./.
- LowQual is in the filter.
Returns: - uncall: bool
True if any of the above items are true, False otherwise.
-
normalise_jc69
(d, ref, names)[source]¶ Normalise distance matrix according to the Jukes-Cantor distance model see: Nei and Zhang: Evolutionary Distance: Estimation,
ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 7Parameters: - d: dict
distance matrix
- ref: str
reference genome file name
- names: list
list of sample names
- Returns
- ——-
- d: dict
normalised matrix
-
normalise_k80
(d, ref, names)[source]¶ Normalise distance matrix according to the Tajima-Nei distance model see: Nei and Zhang: Evolutionary Distance: Estimation,
ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 9Parameters: - d: dict
distance matrix
- ref: str
reference genome file name
- names: list
list of sample names
- Returns
- ——-
- d: dict
normalised matrix
-
normalise_t93
(d, ref, names)[source]¶ Normalise distance matrix according to the Tamura 3-parameter distance model see: Nei and Zhang: Evolutionary Distance: Estimation,
ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 16Parameters: - d: dict
distance matrix
- ref: str
reference genome file name
- names: list
list of sample names
- Returns
- ——-
- d: dict
normalised matrix
-
normalise_tn84
(d, ref, names)[source]¶ Normalise distance matrix according to the Kimura 2-parameter distance model
- see: Nei and Zhang: Evolutionary Distance: Estimation,
- ENCYCLOPEDIA OF LIFE SCIENCES 2005, doi: 10.1038/npg.els.0005108 http://www.umich.edu/~zhanglab/publications/2003/a0005108.pdf, equation 13 and 14
Parameters: - d: dict
distance matrix d = {‘211701_H15510030401’: {‘211700_H15498026501’: {‘A’: {‘A’: 1152.0,
‘C’: 114.0, ‘G’: 545.0, ‘T’: 35.0},
- ‘C’: {‘A’: 0.0,
‘C’: 1233.0, ‘G’: 108.0, ‘T’: 467.0},
- ‘G’: {‘A’: 0.0,
‘C’: 0.0, ‘G’: 1283.0, ‘T’: 100.0},
- ‘T’: {‘A’: 0.0,
‘C’: 0.0, ‘G’: 0.0, ‘T’: 1177.0}}, …}, …}
- ref: str
reference genome file name
- names: list
list of sample names
- Returns
- ——-
- d: dict
normalised matrix
-
parse_vcf_files
(dArgs, avail_pos, aSampleNames)[source]¶ Parse vcf files to data structure Parameters ———- dArgs: dict
input parameter dictionary as created by get_args()- avail_pos: dict
- dict of bintrees for each contig
- aSampleNames: list
- list of sample names
0 also writes all data to avail_pos
-
parse_wg_alignment
(dArgs, avail_pos, aSampleNames)[source]¶ Parse alignment to data structure Parameters ———- dArgs: dict
input parameter dictionary as created by get_args()- avail_pos: dict
- dict of bintrees for each contig
- aSampleNames: list
- list of sample names
0 also writes all data to avail_pos
-
precompute_snp_densities
(avail_pos, sample_names, args)[source]¶ Precompute the number of differences around each difference between each pair of samples
Parameters: - avail_pos: dict
data structure that contains the information on all available positions, like this: {‘gi|194097589|ref|NC_011035.1|’: FastRBTree({2329: {‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb590>,
‘reference’: ‘A’, ‘211700_H15498026501’: ‘C’, ‘211701_H15510030401’: ‘C’, ‘211702_H15522021601’: ‘C’},
- 3837: {‘211700_H15498026501’: ‘G’,
‘stats’: <vcf2distancematrix.BaseStats object at 0x40fbf90>, ‘211701_H15510030401’: ‘G’, ‘reference’: ‘T’, ‘211702_H15522021601’: ‘G’},
- 4140: {‘211700_H15498026501’: ‘A’,
‘stats’: <vcf2distancematrix.BaseStats object at 0x40fb790>, ‘211701_H15510030401’: ‘A’, ‘reference’: ‘G’, ‘211702_H15522021601’: ‘A’}})}
- sample_names: list
list of sample names
- args: dict
input parameter dictionary as created by get_args()
Returns: - dDen: dict
contains the differences between a pair in a window of given size around each difference of the pair {‘diffs’: {‘187534_H153520399-1’: {‘187534_H153520399-1’: 0,
‘187536_H154060132-1’: 1609, ‘189918_H154320283-2’: 295, ‘205683_H15352039901’: 0, ‘211698_H15464036401’: 298, ‘211700_H15498026501’: 298, ‘211701_H15510030401’: 1621, ‘211702_H15522021601’: 297, ‘211703_H15534021301’: 1632, ‘reference’: 4045},
- ‘187536_H154060132-1’: {‘187536_H154060132-1’: 0,
‘205683_H15352039901’: 1605, ‘211698_H15464036401’: 1353, ‘211701_H15510030401’: 1, ‘211702_H15522021601’: 1351, ‘211703_H15534021301’: 7, ‘reference’: 5041} …, },
- ‘gi|194097589|ref|NC_011035.1|’: {‘187534_H153520399-1’: {‘187536_H154060132-1’: {55959: 1,
56617: 1, 157165: 1, 279950: 3, 279957: 3, 279959: 3, 608494: 22, 608537: 23, 608551: 23, 608604: 23, 608617: 24, …,}
- ‘189918_H154320283-2’: {27696: 1,
55959: 1, 56617: 2, 56695: 2, 279950: 3, 279957: 3, 279959: 3, 520610: 1, 608494: 22, …,