2D GRAPHICAL REPRESENTATION OF DNA SEQUENCE BASED ON HORIZON LINES FROM A PROBABILISTIC VIEW REPRESENTAÇÃO GRÁFICA 2D DA SEQÜÊNCIA DE DNA BASEADA EM LINHAS

In this study, we propose a new two-dimensional graphical representation of DNA sequence based on a choice of four horizon lines. The 2D representation is constructed in a probabilistic framework. Following the new approach, we perform the similarity analysis among coding sequences of the first exon of beta-globin gene from eleven species. Our results coincide with current biological analyses. We also compare our method with some existing DNA sequence comparison algorithms and find that ours is more intuitive and effective.


INTRODUCTION
With the rapid development of sequencing technology, more and more DNA data has been acquired. It has currently been a big challenge for scientists to analyze the DNA sequences quickly and effectively. One important step in this topic is to graphically represent the DNA sequence, such that it keeps the information of primary data as more as possible. A large number of biologists, computer scientists and mathematicians applied solid computational tools to represent the biological sequences such as DNA, RNA and protein. Among of them, the graphical representation provides a simple and efficient visualization way, which has been used for numerically compare the biological sequences.
More than thirty years ago, Hamori and Ruskin (HAMORI;RUSKIN, 1983;HAMORI, 1985) introduced the first 3D graphical representation H-curve of DNA sequence. Up to now, many other multi-dimensional graphical representations of DNA sequence were followed (NANDY et al., 2006;JIN et al., 2017), including Zcurves (ZHANG;ZHANG, 1994), 4D graphical representations (CHI;LIAO et al., 2005;TANG et al., 2010) and 6D model (LIAO;WANG, 2004). In particular, Gates (1985), Nandy (1994) and Leong and Morgenthaler (1995) map DNA sequence to a random walk in the (x, y) plane using four unit vectors to represent the four bases along corresponding axis directions respectively. However, those representations may have degeneracy and loss of information. In order to overcome these two problems and analyze genes, many other representations were studied. For example, 2D or 3D representations were discussed (GUO et al., 2001;RANDIC et al., 2003b;a;YAU et al., 2003;YAO et al., 2005;ZHANG et al., 2005;FAN, 2007;CAO et al., 2008;YU et al., 2009;ZHANG, 2009;CAO et al., 2010;XIE;MO, 2011;WANG, 2012;YU et al., 2013;YU et al., 2014;ZHANG et al., 2014;ZOU et al., 2014;. Specifically, Randic et al (2003a;) introduced a 2D graphical representation of DNA sequence based on four horizontal lines and performed similarity analysis by a proposed L/L matrix. Furthermore,  applied probabilistic methods with the help of graphical representation to compare the DNA sequences. Motivated by their works, we assign a nucleotide of DNA sequence as a point on one of the horizon lines. Then using this representation in a probabilistic framework as a new descriptor, the similarity/dissimilarity of the first exon of betaglobin gene of eleven species was studied.  proposed a 2D graphical representation of DNA sequence by defining a probability distribution of it, such that each nucleotide of the sequence has a number as an assigned probability and the sum of the probability equals to one. In their setting, the same nucleotide may have different probabilities, which depends on the location in the DNA sequence. In this section, 745 2D graphical representation… LIU, H we will explain a 2D probabilistic representation based on horizon lines. In contrast to representation in Yu et al.'s work, each nucleotide is indicated by a number as follows:

MATERIAL AND METHODS
Here A and T are corresponding to the same number up to a sign for differing in the graph, so as C and G, since A-T and C-G are two base pairs. Similar to (RANDIC et al., 2003a;, we have four horizon lines paralleling to x-axis with y values 0.3, -0.3, 0.2, -0.2 respectively and each nucleotide of a DNA sequence is mapped to a point on one of these lines such that a representation curve is derived by connecting these points one by one. For example, the representation of sequence TGCAC can be shown in Table 1, and the corresponding graphical representation of this sequence is drawn in Figure 1. This representation also has no loss of information and degeneracy, that is to say, the curve has no circuits and the mapping between the DNA sequences and the curves of graphical representation is one-to-one (YAU et al., 2003).

RESULTS
Based on the representation of DNA sequence, Randic et al (2003a; used matrices to provide numerical characterizations of DNA sequences which could be used to make similarity analysis of DNA sequences. They proposed an L//L matrix to characterize a DNA sequence with 12component vectors, and obtained the similarity result by computing the Euclidean length of the difference of the vectors. This technical of mapping a zigzag curve to a vector is very effective for DNA similarity analysis. In this section, we characterize a sequence as a 4D vector. For two DNA sequences, we also compute the Euclidean length of difference of two 4D vectors derived from these two sequences, which reflects the similarity/dissimilarity between them. For a DNA sequence of length n, we have a corresponding zigzag curve based on the assignment of bases as described in Table 1. Let (x i , y i ) be the point corresponding to the i-th nucleotide of the sequence, then we can define the elements of matrix E evolving y-coordinate only, as follows: So the matrix E is symmetric, and has real eigenvalues such that the maximum of these eigenvalues exits. For a DNA sequence, we have four difference graphs by interchanging assignment values corresponding to bases A and T, similar to bases C and G. These four curves are symmetry with respect to the x-axis. For each choice of the assignment for the basic four nucleotides, we can get a number by taking the maximal eigenvalue. So we get four 746 2D graphical representation… LIU, H numbers ݀ ଵ , ݀ ଶ , ݀ ଷ , ݀ ସ , then a 4D vector by assigning ܲ ሬԦ = ሺ݀ ଵ , ݀ ଶ , ݀ ଷ , ݀ ସ ሻ If we have two sequences with the corresponding descriptors P ሬ ሬԦ ଵ and P ሬ ሬԦ ଶ respectively, then similarity indexes between these two sequences can be defined by the Euclidean distance of these two vectors: Thus the smaller d reflects that the DNA sequences are more similar. Now we use this method to study the similarities among the coding sequences of the first exon of beta-globin genes of eleven species based on the data information listed in (JIN et al., 2017) from GenBank. The result is shown in Table 2. In this table, we can see that (1) Human-Gorilla, Gorilla-Chimpanzee and Human-Chimpanzee are most similar. (2) Among the similarity of Human to other ten species, Human-Gorilla is the minimum, while Human-Gallus is the maximum. (3) Opossum, Lemur, Mouse, Rabbit, Rat, Gorilla, Bovine and Chimpanzee achieve the maximum at Human or Gallus. In order to compare our method with other ways, we list several highly cited results about the similarity of human with other ten species, which it is shown in Table 3. As in (PENG; LIU, 2015), we normalize the index by Human-Goat ratio such that the result can be compared easier. From Table 3, most results indicate that Human-Gorilla and Human-Chimpanzee are more similar than other 8 species, which is consistent with our result.

DISCUSSION
Based on the facts that A-T and C-G are two base pairs, we choose four basic horizon lines where the points corresponding to DNA sequence lie in. Then we use a new matrix E to analyze the similarity/dissimilarity among eleven species by their coding sequences of the first exon of betaglobin gene. We use the same probability for a base even in different species, which is different from the methods used in ZHANG;CHEN, 2011). Roughly speaking, they regard a DNA sequence as an union of disjoint events such that the sum of the probability equals one, while we regard a DNA sequence as a random result, a sequence of probability. On the other hand, we focus the change of the probability of the sequence, and consider the matrix E just involving y-coordinates, in-depending x-coordinates. Thus the expression of E is much simpler than the L/L matrix in (RANDIC et al., 2003a;, which shows that the cost of computing is reduced and our method is much faster. In Table 3, one can see that, by our method, (1) the index of Human-Gallus is the maximum; (2) the indices of Human-Gorilla and Human-Chimpanzee are the two minimums. In fact, (2) is consistent with the results in (RANDIC et al., 2003a;YU et al., 2009;XIE;MO, 2011), but (1) is not true for other methods, such as in the results of (PENG; LIU, 2015). Among these 11 species, only Gallus is not mammalian, while Human, Gorilla and Chimpanzee belong to Primates, thus our result about Human-Gallus is more convincing. It implies the power of our new method.
In this study, we develop a new method to combine the geometrical and probabilistic information to analyze and compare the DNA sequences. Actually, some similar works with protein sequence have also been proposed (YAU et al., 2008;. Our approach can also extended to other biological sequences such as protein sequence. For example, we may consider 20 amino acids as 20 vectors instead of 4 nucleotides as 4 vectors here. Thus, further studies may be needed to decide what combination of 20 amino acid vectors to compare protein sequences. Furthermore, our method with sequence can be extended to study the two-dimensional structures of DNA or RNA as the fact that sequence determines structure and structure determines function. Our study provides an intuitive and efficient tool for DNA sequence comparison studies, which will be used to study more biological data in the near future.