V

The Building of Multiple Alignment Using the Method of Iterative Dynamic Improvement of the Initial “Motif” Alignment

V.K. Nikolaev¹, A.M. Leontovich², V.A. Drachev², L.I. Brodsky^1,3
¹Gendalf Ltd., Moscow, Russia; E-mail: [email protected]
²Belozersky Institute of Physico-Chemical Biology,
Lomonosov Moscow State University,
Vorobievy gori, MSU, Moscow, 119899 Russia;
fax: (7-095) 939-31-81; E-mail: [email protected]
³ To whom the correspondence should be addressed.

In the article there were studied the optimality and the quality of restoration of the true alignment’s columns depending on initial alignment with it’s iterative dynamic improvement [1]. Even with weak similarity of the aligned sequences, iterative improvement of initial multiple alignment, built using the motive method [2], leads to alignment, close to the optimum one. Moreover it is important that the constructed alignment, by it’s “strong” positions, (with good internal homology) is a lot like the “true” one for the given group of sequences. When verification was done, as a “true” one either a biologically correct alignment, or the one, made using computer mutation of “ancestral” alignment, was taken. Iterative algorithm of dynamic improvement of motive alignment was realized as the program “Multiple Alignment” on the WWW server of Belozersky Institute MSU (http://www.genebee.msu.su).

KEYWORDS: alignment of sequences, "motif” alignment, dynamic programming, iterative method, WWW.

In the article [1] there were described methods of iterative dynamic improvement of initial alignment (see below #2). In it, as a “priming” (the initial alignment) was taken either a trivial alignment, for which all sequences were moved to the left, or an alignment, created by a cluster algorithm Clastal V [3].

In the article [2] we offered an algorithm of building the “motif” multiple alignment of biological sequences. This article states that the motif alignment acts like a good “priming” with iterative dynamic improvement of an alignment. Most likely, this happens because in the motif alignment there are “embryo” of sub-optimal positions present [4]. The use of iterative algorithm in case of motif alignment transforms these “embryos” to sup-optional positions, at the same time destroying the false homology columns positions, that compete with the sub-optional ones.

In the beginning we described the methods of motif alignment building and iterative dynamic improvement of the “priming”. As the article goes on, the work of the algorithm will be demonstrated using it as applied to “true” alignments restoration. As “true” alignments there were taken ether biologically-based ones, or alignments built using an artificial computer “mutation” process. There will be introduced values, describing the quality of “true” alignment’ restoration and there will be mentioned results, showing how this quality depends on the kind of the “true” alignment itself, and the chosen “priming”.

MATERIALS AND METHODS

Motif alignment. In the motif alignment method in the beginning there are “MOTIFS” [2]; each of these “motifs” consists of fragments of the same length (without gaps) from different sequences of source set. These fragment are similar and are put against each other. After that from these “motifs” is an alignment composed, which is a chain of matched (looking on the order of letters in the sequence) motifs, in between which there is a needed a number of gaps.

The quality of an alignment could be valued either by percent of homologies - by the sum of weights of residue exchange per one alignment position or so-called POWER [5,6] ( with of power definition, coming from probabilities theory arguments, the sum of weights is divided not by length, like in the case of homology percent, but by the square root of the length).

Method of iterative dynamic improvement of multiple alignment. In this article we are using the simplest method of iterative dynamic improvement of an alignment, that does the following [7]. Say there is a certain amount of sequences and there is a certain initial alignment (“priming”) combined out of those. The lines of the priming alignment (sequences) are taken by turns. On each step the next sequence is removed from an alignment and then, using the method of pair dynamic programming, this sequence is aligned against a part-that-is-left of the alignment. The resulted alignment is used in the next step. This is done until the quality of an alignment doesn’t increase, when going through all the lines of an alignment. Here the quality of an alignment is measured by two parameters: it’s power and the homology percent, so that on each step both parameters should increase: simultaneously.

In [1] more complicated variants of iterative dynamic improvement of an alignment are described. They are concerned with dividing a set of sequences of the initial alignment to two sub-sets at each step. These methods are much more labor-consuming. At the same time, even the simplest variant works satisfactorily in case of right-built priming.

“True” alignments. As it was mentioned before, in the restoration procedure as “true” alignments, the “biologically right” ones, as well as, ones got using modeling of mutation process were taken.

Biologically right alignments were considered the ones that either are considered of that sort by expert-biologists, or correspond with alignment of tertiary protein structures.

Artificial examples, on which there was this method tested, were built using the following computer “mutation” process. There was one sequence selected ( say it was selected randomly). Then this sequence was repeated several times. Thus the alignment was created. It was composed of a certain number of the same sequences. Then, with this identical alignment the mutation procedure was made. It is composed of the following: in some probabilities letter substitutions and deletions were made, deletion is substituting a letter with space. Altogether talking, these probabilities of substitutions depended on the column number and sequence number several of such dependencies were investigated. The important characteristic of the mutational process is it’s duration - the amount of mutations/deletions calculated per one position of an alignment.

The quality of restoration. As the main characteristic of restoration quality taken the following value was taken: standardized number of pairs of letters differences of the “true” and built alignments.

Except for this value, estimating the quality of restoration of the true alignment, we also took into an account how good the “sub-optimal strong” columns restored. “Strong columns” means that to these columns corresponds with a big sum of weights letter substitutions. “Sub-optimal” means the following [4]. We fix the given column and make the best possible alignment to the left of this column, that is those sequence parts are aligned, that are to the left of this column. The same is done to the right of this column. As a result we get a full alignment, containing the given column as a whole. (The approximation to this best alignment is iterative improvement of the corresponding “motif” alignment). After this a small vicinity of the studied column of the alignment is taken. Parts of an alignment, to the right and to the left of the vicinity are fixed, and the vicinity itself is improved by dynamic, iterative method (this is also an approximation of the best alignment within the vicinity). If the studied column was not destroyed (or almost saved), then this column is called sub-optimal.

The quality of restoration of an alignment, that was described before, we calculated not only by all the columns, but also separately by sub-optimal, strong columns (it seems, the last value is more accurate).

Dependence of the “true” alignment’s restoration from the initial alignment (“priming”). As it was said before, both the homology percent of the resulting alignment and the quality of the “true” alignment’s restoration with the procedure of iterative dynamic improvement strongly depend on the priming.

To check this dependence we used primings of several types. One of them - trivial priming alignment - move of all sequences to the least left position so that all gaps would be on the right. The second type of priming - “motif” aligning. The third type of priming is a result of the “true” alignment damage. It consists of the following: with certain possibilities the fragments of gaps of different length are included in or excluded out of the sequences, involved in the true alignment,. The smaller are these possibilities, the smaller is the “damage”.

RESULTS

The results of the work are shown on the tables 1-6 and pictures 1-2. They point out the restoration quality in the procedure of iterative dynamic improvement of an alignment for different kinds of priming.

Each table refers to definite series of computer experiments. Usually, series are made of 30 calculations of the same kind.

The results of computer mutations are shown on the tables 1-4.

On the tables 1-3 the scheme of mutational process is as following. There are 10 sequences of length 50 in a resulting alignment. All of the alignment positions (columns) were divided into two zones. First zone consists of the first 40 positions, the second one - of the last 10. In the first zone probability of mutations (that is letter changes) is equal to p₁, for the second one - p₂ , and probability of deletion (that is crossing out of letters) is, accordingly, q₁ and q₂. For each tables 1-3 there is a composition of probabilities p₁, p₂, q₁, q₂.

Each of the lines of the tables1-3 has its definite kind of priming. For the first line of the table the priming coincides with the true alignment. For the next 6 lines the priming is “damaged” true alignments. The “damage” consists of cutting out or adding , randomly, some amount of gaps. In the second line of each table all gaps are excluded(that is those are trivial alignments) , in the third line part of the gaps that were excluded is equal to about 0.66, in the forth- 0.33. In the lines 5, 6 and 7 gaps were included. The last line of the tables corresponds with the case when the priming was taken as the motive alignment.

Table 1. Results corresponding with 1-st pattern of mutations with parameters:

p₁ = 0.1, q₁ = 0.01, p₂ = 0.6, q₂ = 0.1

Priming kind	Pr₁	Pr₂	Pr₃	H₀	H₁	H₂	H₃	L₀	L₁
True	99.7	45.9	63.3	0.45	1.56	0.25	0.50	50.0	49.9
Schema 1	99.7	33.1	54.8	0.04	1.58	0.26	0.49	47.3	48.7
Schema 2	99.4	36.4	56.8	0.07	1.70	0.28	0.51	48.5	48.0
Schema 3	99.2	36.0	56.3	0.09	1.68	0.26	0.51	49.5	48.7
Schema 4	99.9	43.1	61.3	0.14	1.61	0.25	0.50	57.0	49.5
Schema 5	99.6	33.3	54.8	0.05	1.47	0.27	0.49	59.0	50.0
Schema 6	99.2	33.2	54.5	0.004	1.45	0.27	0.48	82.0	50.0
Motif	99.3	27.2	50.5	0.26	1.21	0.31	0.47	77.2	52.7

Note: Here and in all following tables the following signs are used:

Pr₁ = quality of restoration in strong zone ( zone 2 of table ), that is in percents.
Pr₂= quality of restoration in weak zone.
Pr₃ = quality of restoration in the whole alignment.
H₀ = percent of homology in the priming, that is measured at mean square deviations (in SD measure)
H₁= percent of homology in strong zone of the resulting alignment.
H₂ = percent of homology in weak zone of resulting alignment.
H₃ = percent of homology in the whole resulting alignment.
L₀ = the length of the priming.
L₁= the length of the resulting alignment.
Table 2. Results corresponding with 2-nd pattern of mutations with parameters:

p₁ = 0.2, q₁ = 0.015, p₂ = 0.7, q₂ = 0.15

Priming kind	Pr₁	Pr₂	Pr₃	H₀	H₁	H₂	H₃	L₀	L₁
True	92.3	25.9	51.4	0.27	1.00	0.27	0.37	50.0	46.8
Schema 1	90.8	22.2	48.7	-0.002	1.10	0.28	0.36	43.1	44.6
Schema 2	79.8	23.5	45.1	0.02	0.83	0.26	0.37	46.8	45.6
Schema 3	87.4	23.2	47.9	0.03	0.94	0.27	0.34	48.8	45.2
Schema 4	88.7	25.5	49.5	0.06	1.00	0.22	0.34	57.0	47.6
Schema 5	77.2	21.6	42.7	0.01	0.93	0.22	0.35	59.0	48.9
Schema 6	83.4	20.9	44.9	-0.002	0.88	0.27	0.35	82.0	46.6
Motif	88.4	13.7	42.4	0.18	0.87	0.25	0.36	78.7	52.2

Table 3. Results corresponding with 3-rd pattern of mutations with parameters:

p₁ = 0.3, q₁= 0.02, p₂= 0.8, q₂ = 0.2

Priming kind	Pr₁	Pr₂	Pr₃	H₀	H₁	H₂	H₃	L₀	L₁
True	71.3	21.9	44.6	0.15	0.63	0.23	0.27	50.0	44.6
Schema 1	36.6	14.4	24.3	-0.02	0.28	0.17	0.17	40.4	43.1
Schema 2	39.5	17.4	27.4	-0.004	0.20	0.20	0.20	45.8	44.7
Schema 3	49.9	18.6	32.7	0.001	0.28	0.23	0.23	48.1	44.3
Schema 4	61.1	99.7	38.3	0.02	0.45	0.21	0.24	57.0	45.7
Schema 5	43.3	15.6	27.7	0.001	0.28	0.20	0.22	59.0	48.4
Schema 6	51.9	16.0	32.4	-0.004	0.30	0.23	0.24	82.0	44.9
Motif	62.6	10.4	34.3	0.11	0.43	0.21	0.24	70.9	50.7

Tables 4, 5 and 6 consist of 3 lines , first line corresponds with the case of true alignment as the priming , second line - damaged true alignment ( extra gaps were included) , third line - when priming is a motive alignment.

For the table 4 the other scheme of mutational process with 4 zones, that are shown on the picture 1. Here the overall number of consequences is equal to 12. Each of the zones 1, 2, 3 consist of 4 consequences and 40 positions . Zone 4 consists of 10 last positions in each of the 12 consequences. Each zone is characterized by its composition of mutational parameters p_i, q_i = 1, 2, 3, 4, where p_i - possibility of mutation, and q_i - possibility of deletion ( see picture1 ).

Table 4. Results corresponding with 4-th pattern of mutations with parameters as at picture 1

Priming kind	Pr₁	Pr₂	Pr₃	H₀	H₁	H₂	H₃	L₀	L₁
True	99.7	28.0	70.4	0.33	1.57	0.09	0.37	50.0	49.7
Mutation	43.2	17.8	32.8	0.03	0.40	0.15	0.20	58.2	52.2
Motif	99.8	23.3	68.6	0.28	1.39	0.13	0.36	71.8	50.9

The composition shows that motive alignment as priming is the winner in the quality of the restoration of the true alignment.

Tables 5-6 are analogues to tables 1-3, however they relate to the accident, where as true alignments the biologically approved alignments were taken. In the table 5 that kind of alignment was constructed by biologist-expert. In the table 6 the true alignment corresponds with the alignment of tertiary structure of proteins.

Table 5. The true alignment defined by biologist-expert

Priming kind	Pr₁	Pr₂	Pr₃	H₀	H₁	H₂	H₃	L₀	L₁
True	84.9	60.3	64.5	0.50	0.88	0.48	0.55	337	348
Mutation	69.2	41.1	46.0	-0.001	0.69	0.49	0.52	387	353
Motif	70.2	38.6	44.0	0.26	0.75	0.46	0.51	384	369

Table 6. The true alignment corresponds with the alignment of tertiary structure of proteins

Priming kind	Pr₁	Pr₂	Pr₃	H₀	H₁	H₂	H₃	L₀	L₁
True	64.9	55.4	57.0	0.38	0.49	0.49	0.49	329	314
Mutation	73.6	50.9	54.7	0.01	0.66	0.46	0.48	379	310
Motif	72.3	46.5	50.9	0.23	0.68	0.47	0.49	309	309

Picture 1. Zones for mutational procedure, that corresponds with tables 4 .Here: p_i, q_i - possibilities of mutations and deletions in zone i.

On the pic. 2 the graphs that characterize the quality of restoration of the alignment, according to the quality of restoration of the columns of priming are shown.

Picture 2. Graphs that characterize the quality of restoration of the resulting alignment according to the quality of restoration of the priming’ columns (see paragraph 4). On the abscissa axis for all 3 graphs the quality of true alignment column restoration in priming is measured. On the ordinate axis of graph 2a the quality of resulting alignment, on the graph 2b - root-mean-square deviation, on graph 2c - functions of distribution of columns of the priming restoration quality. The solid line shows the case, when motif alignment is the priming, dotted line - one of the cases of damaged true alignment (when gaps are included).

Picture 2 corresponds with Table 3 .

The outcome of the work is as follows:

The methodology how to estimate the quality of restoration of the columns of the true alignment is proposed. This methodology can be used to compare different methods of alignment construction. At given work it was used for investigating the iterative dynamic improvement of the motive alignment.

From the results we can see that the usage of the priming motive alignment improves the quality of restoration. Especially this is true for the “good” columns of priming - those that do not strongly differ from the corresponding columns of the true alignment . This is characterized , for example, by the picture 2a. The graph there corresponds with the motive alignment. On the right part of the picture it is higher than the graph for damaged priming and even higher than the diagonal. Vice versa , on the left side of the picture ( that response to small level of correspondence of the columns of the priming and the true alignment) the quality of the restoration of the motive priming is lower then for the damaged. However this, from our point of view, is not a minus , because the method is not claiming to restore strongly destroyed columns.

It’s necessary to mention that the restoration of so called “suboptimal” columns of the true alignment [4] is better than also strong, but not “suboptimal” columns.

Authors highly appreciate to Dr. Alexander E. Gorbalenya for the help in choosing biological examples.

The work was supported by grant 95-07-19298 of Russian Foundation for Basic Research.

LITERATURE CITED

1. Hirosawa M., Totoki Y., Hoshida M., and Ishikawa M. (1995) CABIOS, 11, 13-18.

2. Brodsky L.I., Drachev A.L., Leontovich A.M., and Feranchuk S.I. (1993) Biosystems, 30, 65-79.

3. Higgins D.G., Bleasby A.J, and Fuchs R. (1992) Comput. Applic. Biosci., 8, 189-191.

4. Vingron M. (1996) Current Opinion in Structural Biology, 6, 346-352.

5. Leontovich A.M., Brodsky L.I., and Gorbalenya A.E (1991) Biopolimery I kletka, 7, 10-14 (in Russian).

6. Leontovich A.M., Brodsky L.I., and Gorbalenya A.E. (1993) Biosystems, 30, 57-63.

7. Barton J.G., and Sternberg M.J.E. (1987) J. Mol.Biol., 198, 327-337.