Genome-wide Landscaping of Repetitive Elements in the Genome of Neurospora crassa

Objectives: Repetitive elements are ubiquitous in eukaryotes and have contributed in expression of genes and maintaining the architecture of genome. Fungi although being a eukaryote, repeat sequences are typically limited due to small genome size and typical genome defense mechanisms which fights against expansion of repeats and maintains streamlined genome. Methods/Findings: Despite, the presence of such mechanisms Tad element was found to be active in Adiopodoume strain of N. crassa. This is one of the most enigmatic discoveries made in Neurospora. From this perspective, we have analyzed the genome of N. crassa OR74A (NC12) by in silico tools for the identification of repeats. The comprehensive analysis of these elements suggests that about six (PuntRIP) DNA transposons and 2 Tad1 (Non-Long Terminal Repeat) elements share highest homology with known repeats from RepBase. The presence of DNA binding and transposase domain in the protein sequence of NCU02991 indicates that the element identified is a DNA transposon. Application: The results also suggest that among the identified six PuntRIP sequences four of them share homology, which indicates that similar sequence is repeated four times within the genome. The promoters identified in the dataset of identified repeats in the upstream and downstream regions indicate that these repeats may play certain role in the regulation of genes. *Author for correspondence


Introduction
Non-coding DNA, rather being useless was thought to be materialistic remnants from the variations caused in the evolutionary process. But, the current developments suggest that non-protein-coding DNA are helpful for genome in directing various other information such as their role in genome structure, function, gene regulation, rapid speciation and genome defense systems. These non-coding regions consist of repetitive elements which are the copies of nucleotides that repeat throughout the genome. Classically, repeats are classified into three main types: Terminal repeats, Tandem repeats and interspersed repeats, of which interspersed repeats are very well studied. Accomplishment of human genome project and other 150 eukaryotic genome sequencing projects have opened a new era in the dawn of 21 st century. This new era emphasizes the study related to function and importance of repeats. According to the selfish or parasitic DNA theory, non-coding DNA persists only because of its power to replicate itself, or it has mutated into a form advantageous to the cell. Eukaryotic genome constitutes millions of copies of repeats 1 that might play a crucial role in various metabolic processes. Understanding and analyzing such a huge content of uncharacterized repeats Keywords: DNA Transposons, Non-LTR, Repetitive Elements, Tad1 in the genome opens up new mechanisms that play a crucial role in the regulation of genes in eukaryotes 2 .
Blazing advances in the field of genetics, genomics, and molecular biology with the support of Bioinformatics studies, allowed scientists and researchers to fathom in detail about the numerous features of repetitive elements. The pervasive idea proposed by the Nobel laureate Barbara McClintock on Transposable elements 3 has completely changed ideology of researchers. Since 1950's, much research has taken place on repetitive elements viz., recombination of Alu elements that have played an important role in the evolution of human glycophorein gene family 4,5 , presence of Alu element in the third intron of ornithine aminotransferase 6 and presence of SINE in coding region of mRNA of bovine prostaglandin E2 receptor etc 7 . Similarly ENCODE (Encyclopedia of DNA Elements) project has deduced that non-coding regions of DNA play an important role in regulatory mechanisms 8 . All this research insinuates that repetitive elements are no longer said to be junk elements.
Repetitive elements are capable of self-replicating, but certain genomes tend to fight against their expansion by three known pathways of gene silencing mechanisms viz., Quelling, RIP (Repeat Induced Point Mutation) 9 and MSUD (Meiotic Silencing by Unpaired DNA) 10 . These mechanisms are very well opted by all types of filamentous fungi, of which, Neurospora provided the first example for eukaryotic genome defense system. In Neurospora, Quelling co-operates with RIP and MSUD in controlling the expansion of transposons within the genome. Quelling is a Post-Transcriptional Gene Silencing mechanism (PTGS) observed in Neurospora, which is similar to co-suppression in plants and RNA interference (RNAi) in animals 11 .
Despite having such strong defense mechanism, an active non-LTR known as Tad element was identified from Adiopodoume strain of Neurospora 12 , which is capable of missing the RIP pathway 13 . An inactive non-LTR known as Punt RIP element was also identified in N. crassa in the methylated pseudogene which is inactivated by RIP 14 . Based on this, the present work was carried out to identify the presence of repetitive elements in N. crassa OR74A (NC12) strain with the help of in silico tools. This study will shed a light on the presence of non-coding elements in N. crassa and further it will be helpful in understanding the role of non-coding DNA and its capability to skipout from the strong genome defense mechanisms in N. crassa.

Materials and Methods
Genome sequence of N. crassa was obtained from the sequencing consortia: Fungal Genome Initiative (BROAD Institute) 15 , now the sequence information is available at FungiDB 16 . Retrieval of all the sequences was performed before August 26 th , 2013 which was publicly available. To detect the sequences corresponding to repetitive elements two programs were applied on genome sequence of N. crassa. The combination of two tools is applied, since there are no specific tools to identify repetitive elements in fungal genomes. The complete genome of N. crassa was analyzed for repetitive elements using RepeatMasker 17 and Censor 18 with RepBase library as reference dataset, a manually curated repetitive sequence library. The size and homology in the repeat identification were low cutoff thresholds for Censor (uses DASHER3 algorithm) (4.5), low ratio of mismatches to transitions (2:1) and a relatively high LOCAL alignment score (30.0). In both the programs sequence source was set to fungi, RepeatMasker uses HMMER whereas Censor uses WU-BLAST (Washington University -Basic Local Alignment Search Tool) for prediction of repetitive elements.
Results from two programs were merged, yielding a set of repetitive elements. Nucleotide sequences harboring least score and low similarity with that of reference sets were discarded for further analysis. The nucleotide sequences were subjected to BLAST to predict the function and were then translated to protein sequences. For each translated sequence, the respective domains were identified by ProDom 19 . Figure 1 represents the step by step process for identification of the repetitive elements in N. crassa.

Results and Discussion
In the arena of DNA sequencing, the faster and cheaper technologies for sequencing have led to the challenge of annotating the sequencing data, which includes both coding and non-coding regions. The availability of complete genome sequence of N. crassa and identification of Tad element and Punt RIP element sensitive to RIP pathway has driven the present study for identification of repetitive elements in N. crassa.

General Genome Features of Neurospora crassa
The N. crassa genome was assembled into 21 scaffolds with N50 approximately of 6.07 Mb encompassing 39Mb.

Identification of Repetitive Elements
In the present study identification of repetitive elements in N. crassa was carried out through series of steps. A total of 9730 genes constituting a nucleotide sequence of 23,369,813bp were retrieved from BROAD institute. Similarly, for each gene their upstream and downstream sequence lengths of 1,000bp each were also retrieved.
Since there are no precise tools for predicting the fungal specific repetitive elements, a combined approach using Censor and RepeatMasker were employed to identify the repeats in the genome of N. crassa. The resultant data was mined based upon their occurrence in different regions (genes, upstream and downstream) of the genome. The in silico predictions of repetitive elements in N. crassa have retrieved two different types of elements based on the sequence similarity with the elements present in the RepBase database.

Annotation of Identified Repeats
The repeat families that were identified are categorized into different datasets. From these datasets, we have manually reviewed them based upon their score of similarity with that of the known repeats. The nucleotide sequences that were found to have least scores with less similarity or no similarities against reference data sets were discarded for further analysis. The repetitive elements that were identified in the different regions of genome were observed to be similar with known active non-LTR Tad1 and non-active DNA transposon Punt RIP of N. crassa Table 2. The nucleotide sequences of gene NCU02991, upstream regions of NCU05159, NCU07945 and downstream region of NCU02990, NCU07896, NCU09994 were found to be similar to the non-active DNA transposon Punt RIP , which was originally found in the 5s rRNA pseudogene of N. crassa. The only active repetitive element found in N. crassa is Tad1, a LINE like non-LTR 20 . By in silico analysis we could retrieve similar kind of nucleotide sequences in the upstream region of NCU016528 and to the downstream region of NCU03846.

N. crassa
These nucleotide sequences were further analyzed for the sequence similarity with that of known repeats by local alignment and the masked sequences were obtained. The alignment files and the masked regions of the sequences are given in the additional file 1. With the sequence similarity, we have even identified the start and end positions of repeats in the nucleotide sequences with that of reference repeats Table 3. The sequence analysis indicates that the identified repeats are about >70% similar to that of known repeats.
The nucleotide sequences of these identified repeats were subjected to functional analysis by BLAST. The nucleotide sequence of NCU02991 being a gene have retrieved the homology with itself, which is annotated to be a hypothetical protein with a function to bind DNA. Interestingly, the nucleotide sequences of upstream to NCU07945 and down streams of NCU02990 and NCU07896 were also observed to share homology with NCU02991. As it is known that DNA binding is one of the integral domain of the DNA transposons 21  Legend: The start and end positions of identified repeats with that of reference repeats Table 3. The start and end regions of repeats identified hypothetical protein whereas no homologies were observed for upstream to NCU09994 and downstream of NCU03846.
The results also suggest that the similar sequence of Punt RIP was found to be repeated for four times within the genome. These four regions include, gene NCU02991, upstream to NCU07945 and down streams of NCU02990 and NCU07896, as their sequences have shared homology. Further these sequences were subjected to multiple sequence alignment using T-COFFEE 22 . The overall alignment of the four sequences on bad to good scale was observed to be average with a total score of 310. The multiple alignment results also reveal that most of the In the gene sequence of NCU02991 the ORF and exon regions were predicted to be spanning in the region of 1-813bp. Furthermore, the ORF and exon  Table 4. Promoters predicted in the dataset of identified repeats in upstream and downstream regions regions were observed to be within the repeat region of 1-975bp. This indicates that repeat region is playing a crucial role for the protein coded by this gene. Hence, to understand the function of its translated sequence, the protein sequence of NCU02991 was retrieved and was subjected to ProDom for domain analysis. DNA binding homeodomain and Tc5 transposase domain were the two domains predicted for NCU02991. These domains were schematically represented based on their positions in the protein sequence of NCU02991 Figure 2. The DNA binding domain uses helix-turn-helix structure for DNA recognition 23 and the transposase domain acts as a catalyst for cut and paste mechanism 24 . These are the basic characteristics of DNA transposons and are predicted in the protein of NCU02991 which substantiates that NCU02991 belongs to the class DNA transposon. Similarly, for the identified repeats in the upstream or downstream of genes, promoters and the transcription start sites were predicted. Except in the nucleotide sequence which is downstream to NCU02990 the promoter and TSS could not be predicted whereas in other six predicted repeats both promoters and TSS have been predicted Table 4. The prediction of promoters and transcription start site indicates that these repetitive elements may have certain role in regulation of the genes.

Conclusion
Repetitive elements were initially considered to be a part of junk DNA but after an incessant encroachment in the arena of DNA sequencing and ENCODE projects, suggest that repetitive elements have a role in regulation. A systematic screening for identification of repetitive elements in the genomic sequence of the non-pathogenic fungus N. crassa was carried out. The screening strategy predicted 8 repetitive elements in the genome of N. crassa of which six are DNA transposons (Punt RIP type) and two are non-LTRs (Tad1 type). The DNA transposon, Punt RIP was identified in the gene of [FGSC: NCU02991] which encodes for a hypothetical protein and the domain analysis of this protein revealed the presence of DNA binding and transposase domain, which are the common domains in the transposons. The promoters with transcription start sites were identified in the dataset of repeats that are identified in the upstream and downstream regions, which indicates that the repeats that are identified may play role in the regulations of certain genes.

Acknowledgement
This work was supported and funded by University Grants Commission (UGC) with grant number F.No.42-669/2013 to PK. The computational part of this work was executed using the Sun Workstation provided by the Bioinformatics facility from the Department of Biochemistry and Bioinformatics, Institute of Science, GITAM University. Authors would also like to thank Department of Biotechnology, Institute of Science, and GITAM University for providing all the necessity facilities. PK acknowledges the support from a Project operated within the UGC-Major Research Project Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.