Approximate matching - Department of Computer Science
Approximate matching
Ben Langmead
Department of Computer Science
You are free to use these slides. If you do, please sign the guestbook (teaching-materials), or email me (ben.langmead@) and tell me brie y how you're using them. For original Keynote les, email me.
Read alignment requires approximate matching
Read
CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC
Reference
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT
Sequence differences occur because of...
1. Sequencing error 2. Natural variation
Approximate string matching
Looking for places where a P matches T with up to a certain number of mismatches or edits. Each such place is an approximate match.
A mismatch is a single-character substitution:
T: G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G
P:
||| ||||| GTAACGGCG
An edit is a single-character substitution or gap (insertion or deletion):
T: G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G
P:
||| ||||| GTAACGGCG
Gap in T
T: G G A A A A A G A G G T A G C - G C G T T T A A C A G T A G
P:
||||| ||| GTAGCGGCG
T: G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G
P:
|| |||||| GT-GCGGCG
Gap in P
Hamming and edit distance
For two same-length strings X and Y, hamming distance is the minimum number of single-character substitutions needed to turn one into the other:
X: G A G G T A G C G G C G T T T A A C
| |||| ||| |||||||
Y: G T G G T A A C G G G G T T T A A C
Hamming distance = 3
Edit distance (Levenshtein distance): minimum number of edits required to turn one into the other:
X: T G G C C G C G C A A A A A C A G C
|| |||||||||| ||||
Y: T G A C C G C G C A A A A C- AC GA CG C
Edit distance = 2
X: G C G T A T G C G G C T A A- CA GC CG C
|| |||||||||| |||
Y: G C T- AT TA GT CG GC G CG TC AT TA AT CA GC CG C
Edit distance = 2
Approximate string matching
Adapting the naive algorithm to do approximate string matching within con gurable Hamming distance:
def
naiveApproximate(p,
t,
maxHammingDistance=1):
occurrences
=
[]
for
i
in
xrange(0,
len(t)
--
len(p)
+
1):
#
for
all
alignments
nmm
=
0
for
j
in
xrange(0,
len(p)):
#
for
all
characters
if
t[i+j]
!=
p[j]:
#
does
it
match?
nmm
+=
1
#
mismatch
if
nmm
>
maxHammingDistance:
break
#
exceeded
maximum
distance
if
nmm
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- string manipulation in python renan moura
- pexpect documentation read the docs
- python regular expressions picone press
- python xml unittest documentation read the docs
- strings and pattern matching purdue university
- partial match retrieval using indexed descriptor files
- flowstring partial streamline matching using shape invariant
- partial string matching algorithm ijert
- on hash coding algorithms for partial match retrieval
- ensemble prediction by partial matching byron knoll
Related searches
- igcse computer science workbooks pdf
- igcse computer science workbook
- list of computer science topics
- benefits of computer science degree
- history of computer science pdf
- fundamentals of computer science pdf
- benefits of computer science career
- benefits of computer science education
- doctor of computer science salary
- examples of computer science math
- list of computer science journals
- computer science projects for science fair