In Silico Analysis of Structural Photosynthetic Genes of Arabidopsis thaliana for Unique Mirror Repeats

Objectives : The underlying work explores mirror sequences in the photosynthetic genes of Arabidopsis thaliana . At present, these sequences are standing at the forefront to be explored for their origin, distribution and function in plants. Methods : FPCB, a recently developed bioinformatics approach was uti-lized for identiﬁcation of mirror sequences. It is a three step strategy based on pattern matching of alignments, produced after aligning gene sequence and its complement using mega-BLAST. This algorithm was quick and eﬃcient enough to characterize a range of mirror sequences. Findings : All the analyzed genes were reported to harbor great variety of mirror sequences at quite high frequencies. LHCA1 gene have the highest total count of these sequences and ATPB gene have lowest of all. A total of 401 unique mirror sequences of different lengths and compositions were reported in the twelve selected genes. Promoter motifs were found to be greatly enriched with these repeats. Eleven mirror sequences of signiﬁcant lengths were also reported using the above approach. Novelty : This work is the very ﬁrst attempt to characterize photosynthetic genes of Arabidopsis thaliana for mirror repeats. This will further aggra-vate eﬀorts to develop ﬁngerprinting techniques based on these unique mirror sequences, which are very powerful tools to study taxonomic and evolutionary relationships. Mirror sequences are also potential candidates as drug delivery systems and in molecular medicine.


Introduction
The DNA code is the template which directs the organization and function of a complete organism by directing synthesis of range of macromolecules (1,2) . Repetitive DNA sequences gradually gets accumulated into the genomes during the course of evolution because of non-repairable variations arising from events like mutations, slipped strand mispairing and unequal crossing (3,4) . The term repetitive DNA sequences collectively represent the DNA fragments which are present in multiple copies into the genomes (5) . Despite their conscious presence, they were regarded as junk for a while but now they have well-established roles in a number of basic genetic processes (6)(7)(8) . Eukaryotic genomes are differentially enriched in a variety of repetitive DNA sequences (9) . When these repetitive elements are clustered together nearly at the same patch, they are termed as tandem repeats or if dispersed throughout the genome at irregular intervals, they are termed as dispersed repeats. Dispersed repeats being more common, can be further classified as LINES (Long Interspersed Elements, with length ≈1-7kbps) and SINES (Short Interspersed Elements, with length ≈100-400bps) (10,11) .
If one considers symmetry or arrangement of nucleotides as a classifying criterion, three prominent types of DNA repeats can be observed namely, direct, indirect and mirror repeats. When a simple core repeat unit is duplicated downstream uninterruptedly, it represents a direct repeat. When genomic segments duplicated downstream but on the opposite strand, this represents an inverted repeat (12) . According to Mirkin and his coworkers, mirror repeats are DNA segments in which DNA bases that are equidistant from the symmetry center are identical to each other.
All of these symmetrical DNA repeats are capable of adopting a wide range of alternative or Non-B DNA conformations (13) . At inverted repeat motifs, cruciform and hair-pin like structures are formed, depending upon the DNA characteristics (14) . Direct repeats being the simplest ones are associated with formation of a diversity of alternative structures like Z-DNA at sequences of alternating purine-pyrimidine, slippage structures, H-DNA and G-quartet with varying compositions and topological conditions (15) .
Homo-purine:homo-pyrimidine rich DNA tracks with mirror symmetries (or H-Palindromes) is reported to form a significant triplex structure known as H-DNA in the negatively supercoiled state (16) . H-Palindromes are over represented in eukaryotic genomes but occur at change frequencies in the prokaryotic genomes (9) . In the light of earlier investigations, it was hypothesized that H-DNA could act as a molecular switch to modulate the gene expression and many cellular proteins might be stabilizing them specifically (17) . These structures are known to act as regulatory elements in transcriptional, sitespecific mutagenesis or recombination and replication and other DNA-metabolism events in a structural dependent manner (18) . They also act as mutational hotspots and repeat induced mutagenesis in one of the well exploited mechanism in evolution of species (19,20) .
Above line of evidences supports the notion that H-DNA structures are functionally important to genome function, and hence maintained in the genome because of positive selection pressure during evolution. But genomic instability caused by them in the coding and regulatory regions results in the development of a number of disorders like Autosomal Dominant Polycystic Kidney Disease (ADPKD), Tuberous Sclerosis Complex (TSC), Lymphangioleiomyomatosis (LAM), Friedreich's ataxia, Follicular Lymphoma, Hereditary Persistence of Fetal Hemoglobin (HPFH) (21)(22)(23) . Hence, it would be tempting to speculate their origin and function in the genome. But on the very first front, mirror sequence motifs which can form H-DNA, need to be identified thoroughly in the genomes. This can also help in understanding the molecular mechanisms resulting in disease development in human beings.
DNA tracks with mirror symmetries were reported in many eukaryotic species ranging from yeast to humans (24,25) , chloroplast genome of Nicotiana tabaccum plants (26)(27)(28) and recently in the deadly SARS-CoV-2 viral strain (29) . Our previous work has also reported the presence of mirror repeats in the flowering genes of Arabidopsis thaliana (30) and in the genome of HIV (31) . This was huge in terms of their ubiquitous presence across all eukaryotic kingdoms and their parasites also. The functional aspect of H-DNA is still not investigated in plants. This study is the very first attempt to unmask the story of mirror repeats in plants. For this purpose, the most beloved model plant for genetics and molecular bioinformatics Arabidopsis thaliana was chosen. This is the best match for above study since, it has small genome which is completely sequenced. https://www.indjst.org/ FPCB (Fast-Parallel Complement-Blast), a simple, accurate and swift manual bioinformatics based approach was deployed to carry out the present investigation. It is a three step strategy based on pattern matching of alignments, produced after aligning gene sequence and its complement using megaBLAST, to extract mirror repeats (32) . This work will prove very significant in order to further study the function and evolution of mirror repeats in plants and development of range of many fingerprinting technologies based on them. They may prove as key tools in taxonomic and evolutionary studies. If it would be possible in near future, to selectively modulate site specific gene expression and H-DNA formation, through these sequences, it would prove very significant in the field of molecular medicine and drug-delivery systems.

Material & Methods
A total of 12 structural genes involved in formation of major subunits of photosynthetic apparatus in Arabidopsis thaliana were selected, at least one from each subunit (PSII, PSI, ATP synthase, Cytochrome complex) (33) .

S.N. Gene
Symbol Above listed gene sequences were analyzed for presence of mirror repeats using FPCB approach. Originally developed by Vikash et al, FPCB stands for Fasta-Parallel Complement-BLAST. It is a simple three step bioinformatics based strategy for extracting both short and long mirror sequences through a pattern matching algorithm. The strategy is based on the principle https://www.indjst.org/ that during parallel DNA PCR with a parallel primer in the reaction mixture, the resulting product is the original template DNA but with a reversed orientation. With this basic idea, if we align the original gene sequence with its parallel complement using BLAST analysis, many alignments produced were shown to have mirror symmetries (34) . Please refer to fig.2 for pictorial representation of methodology.  (35) . The complete gene was divided into smaller regions of 1000bps each, to extract maximum number of repeats. Single such region will be termed as query sequence in underlying paper. STEP 2: CONVERTING QUERY SEQUENCE TO PARALLEL COMPLEMENT The parallel complement of query sequence was then extracted using the online available bioinformatics Reverse Complement Tool (https://www.bioinformatics.org/sms/rev_comp.html) (36) . The parallel complement will then represent the subject sequence. STEP 3: ALIGNMENT OF QUERY AND SUBJECT SEQUENCE The query sequence and subject sequence stated in the above steps were aligned for homology through the most prominently used bioinformatics domain Align Sequences Nucleotide BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=MegaBlast &PROGRAM=blastn&BLAST_PROGRAMS=megaBlast&PAGE_TYPE=BlastSearch&BLAST_SPEC=blast2seq&DATABAS https://www.indjst.org/ E=n/a&QUERY=&SUBJECTS) (37) . For the present study, the program was optimized for somewhat similar sequences with word size 7 and expected frequency or E-value at 100.

Identification of mirror repeats
Identification of mirror repeats was carried out on the basis of Pattern Matching Algorithm: If the position number of alignments is exactly reversed in subject and query sequence, it will be a mirror sequence. The total number of mirror sequences present in the query sequence/gene sequence can also be confirmed through Dot Matrix Plot. In such plots, alignments were represented as dots over a matrix. The number of dots present on the diagonal depicts the total number of mirror alignments present in the analyzed sequence.

Results & Discussion
The present study is the very first attempt to study mirror repeat sequences in the coding regions of Arabidopsis thaliana, by utilizing a very simple recently developed bioinformatics strategy. Many previous studies have talked about predominance of mirror repeats in many organisms including viruses, animals and plants (38) . But none of them targeted the model plant organism, which is also a valuable bioinformatics resource. In the lieu of recent studies, this work also confirms the over representation of mirror repeats in the photosynthetic genes of Arabidopsis. Identification of such sequences will further paved the way to study their roles in the plant kingdom and how they are evolving continuously inside the genomes.

Distribution of Mirror repeats in the photosynthetic genes
The present decade have concretely established role of many different DNA repeats including the mirror repeats. Mirror repeats which are rich in purines and pyrimidine's are capable of adopting an alternative DNA secondary structure, the H-DNA. This very structure is proving to be a promising candidate for drug-delivery systems in molecular medicine because of its ability to modulate site-specific gene expression. Hence, the present investigation has worked upon identification and distribution of mirror repeats in the model plant to further carry out research on the functional aspects of mirror repeats in plants. Arabidopsis thaliana has a compact genome of~135bps, which relatively lacks repetitive DNA segments as compared to other plant genomes (39) . But our study, has confirmed the overabundance of mirror repeats in all the studied genes of Arabidopsis. This might be indicating that how important these repeats are to the genome, as these are very well conserved during very high evolutionary pressure for compact genome size and shorter cell cycle in Arabidopsis. For the first time, we have reported the occurrence of a total of 401 unique mirror repeat sequences of varying lengths and composition in the photosynthetic genes of Arabidopsis. Studies have already documented that in Arabidopsis, the chloroplast circular DNA composed of 154,478 bp and have inverted repeats in it (40) . Since, many photosynthetic genes were majorly located on the chloroplast DNA itself, the present study confirms the presence of mirror sequences on it also.

Fig 3. Distribution of MR's (mirror repeats) in all the selected genes
As per our analysis, LHCA1 harbors the maximum numbers of MR's and ATPB have the minimum count even having a similar gene size (the gene length of LHCA1is 1.6kb and that of ATPB is 1.5Kb). Hence, gene size is not a determining factor for abundance of MR's. But this might be because of their functional differences because LHCA1 is a very prominent protein subunit https://www.indjst.org/ of light harvesting complex associated with photosystem 1 and ATPB encodes the beta subunit of ATPase synthase machinery. Irrespective of their locations, whether situated on chromosomes or on chloroplast DNA, all the genes have comparable number of mirror sequences. This was in line with the already stated results about enrichment of mirror sequences in the eukaryotic genomes. All the mirror sequences were randomly distributed with no definite pattern into the gene sequences. The novel strategy deployed in exploring mirror symmetries, FPCB, have yielded a combination of both short and long mirror sequences with size ranging from 7-50 bps. These sequences were characterized on the basis of presence of spacer nucleotide in them. Imperfect mirror sequences were more common as compared to perfect sequences (41) . Single spacer, double spacer and multi-spacer mirror sequences were a part of imperfect group.
Highest density of mirror symmetries was present in the regions of size ranging from 1-1000bp, which typically corresponds to the promoter domains of the genes. Their enrichment in the promoter regions of the genes supports the notion that they are involved in replication and transcriptional regulation (42) . As we go downstream into the regions of gene, their abundance generally decreases.
Identified mirror sequences of significant lengths can be worked out for their taxonomic distribution and evolutionary relevance. Only Sequences of length ≥28bp were considered significance for further studies because they are taken according to the default parameters (word size value) of megablast analysis (43) . Eleven such sequences were identified in six genes as depicted in Table 2.
Out of the twelve genes studied, only six genes namely atpA, LHCA1, LHCB2.1, petB, psaA, psaB were reported to have MR's of significant lengths. Remaining others has shorter sequences of length less than 28bps. The longest mirror sequence was identified in the LHCB2.1 gene with a length of 54bps. It was an imperfect mirror repeat with multi-spacer nucleotides. Perfect symmetries were common only among shorter mirror sequences. None of these identified sequences of significant lengths was a perfect mirror repeat. All of them are imperfect mirror repeats. These sequences can be effectively used for profiling genes and even whole genomes, which paves a way towards new criterion of classification of genomes. Many questions are still open to research like whether a particular mirror sequence is only located over a certain region or it is dispersed throughout the https://www.indjst.org/

Conclusion
The present work has reported the occurrence of mirror repeats in the photosynthetic genes of Arabidopsis thaliana for the very first time. Direct and inverted repeats were earlier noted in the genomes of Arabidopsis. The presence of all of these simple DNA repeats in the genome of Arabidopsis signifies their importance in the plant genomes. But by which mechanisms they are doing so, is still an unanswered question. The present in-silico analysis stated a total of 401 mirror sequences in the twelve photosynthetic genes of different lengths using FPCB technique. Amongst them, eleven mirror sequences of significant lengths were traced out in the six genes. Shorter repeats of length 7-12bp were quite abundant. Irrespective of the sizes, these sequences were present in each of the studied gene. The recently developed bioinformatics strategy FPCB, was quite efficient in terms of extracting a large number and varieties of mirror repeats. A new bioinformatics based software can be developed in the near future based on the above used algorithm. There predominance points out towards their conservative nature in genes and hence predicts their important roles in evolution of genome and chromosome architect. Some of these sequences which are enriched in purines or pyrimidine's might adopt H-DNA conformation and represents a plausible target for gene editing and drug deliveries. There are still many questions open to plant biologists relating to role of mirror sequences and H-DNA in plants. It is still to be explored, whether all of the above reported mirror sequences are capable of adopting the H-DNA conformation and under what specific conditions? How are they modulating gene function? This study opens a new forefront to plant biotechnologist to study about mirror repeats in Arabidopsis thaliana.

Acknowledgement
We would like to thank Shri Mohinder Singh ji (Hon'ble Chancellor, Starex University) for providing facility to perform present research. We also want to express our gratitude to Prof.(Dr.) M. M. Goel (Hon'ble Vice-Chancellor, Starex University) for providing all time courage and support. We also thank Dr.Vikash Bhardwaj for introducing the concept of FPCB to us.