Cluster short, homogeneous strings (DNA) according to common substructures and extract consensus classes

Objective:
group a large pool of short DNA fragments into classes that share common subsequence patterns and find the consensus sequence of each class.

  • Pool: approx. 300 sequence fragments
  • 8 - 20 letters per fragment
  • 4 possible letters: a, g, t, c
  • each fragment is structured in three regions:
    • 5 common letters
    • 8 or more g and c positions
    • 5 common letters
      (Like a regular expression that would be [gcta]{5}[gc]{8,}[gcta]{5})

Plan:
to perform multiple alignment (i.e. with ClustalW2) to search for classes that have common sequences in area 2 and their consensus sequences.

Questions:

  • ?
  • 2 , , ?
  • ?

,

+5
2

, 300 FAR TOO FEW , , 8-. 65 536 8- 3 000 000 000 ( , , ). G/C, 3 000 000 000/65 536 * 2 ^ 8 = ~ 12 000 000 (, , , CpG ). 300?

. 1, CG GC , -G--C. , ( ). .

Clustal - , . GC, :

  • G/C 8-mer ( 2 ^ 8 = 256).
  • GC- , , 8- .
  • GC- .

8-mer , . .

+2

, , , (, ).

+1

All Articles