A sample article title



In silico identification of novel cis-regulatory elements in Mesorhizobium loti

Feroz Khan (M.Tech. Biotech)1, Shipra Agrawal (Ph.D)2 and B. N. Mishra (Ph.D.)3§

1,3Department of Biotechnology, Institute of Engineering & Technology, Lucknow

2Jawahar Lal Nehru Center of Advanced Research, Bangalore, India

E-mail: 1ferozkhan306@, 2shipra_ag@yahoo.co.uk &

3biotechiet@

§Corresponding author

Dr. B.N. Mishra

Assistant Professor & Head

Department of Biotechnology

Institute of Engineering & Technology,

Sitapur Road, Lucknow-226012 (U.P.), India

Phone: +91 522 2363220, 2733148 Ext. 206 (O), +91 522 2731636 (Fax)

Email: biotechiet@

ABSTRACT

A computational approach was designed to detect over-represented hexanucleotide(s) located within -400 bp upstream sequences of four data set of genes similar to cellular functional categories viz. Nitrogen fixation, Symbiosis, Nitrogen metabolism and Glutamate family in Mesorhizobium loti; a symbiont to model legume plant Lotus japonicus. The upstream sequences of these genes were analyzed for known transcription factor (TF) binding site(s) and then verified statistically along with experimental data comparisons using over-represented hexanucleotide frequencies. Finally eight families of known TF/binding sites were recognized in all sets as high affinity novel motif patterns. Genome wide occurrence of detected patterns was verified, which had several nif genes, nod genes, nitrogen metabolism related genes and amino acid biosynthetic genes. These findings in the genome of M. loti may lead to more intricate analysis of regulatory network involved in symbiotic interaction with the host plant L. japonicus.

Keywords: Regulatory binding motifs, TF binding sites in M. loti, Hexanucleotides motifs, Nitrogen fixing bacteria.

INTRODUCTION

The majority of computational analyses has been done in coding sequences or between proteins and less in non-coding sequences [van Helden, J, 2003]. Non-coding regions are of interest since they govern the regulation of gene expression. Regulatory profiles of known and unknown genes are already being determined experimentally at a genomic scale, by DNA microarray technology [Schena et al., 1995; Schena et al., 1996; Schena M, 1996; Goffeau et al., 1997; Lashkari et al., 1997; DeRisi et al., 1997]. Besides, several programs have been developed to isolate unknown patterns shared by sets of functionally related DNA sequences [Waterman et al., 1984; Galas et al., 1985; Mengeritsky G and Smith TF, 1987; Stormo GD & Hartzell GW, 1989; Hertz et al., 1990; Lawrence C & Reilly A, 1990; Cardon LR and Stormo GD 1992; Lawrence et al., 1993; Neuwald et al., 1995; Hertz G and Stormo G, 1995; Wolfertstetter et al., 1996; Bulyk et al., 2004]. These programs have been inspired by a particular type of signals and are generally highly efficient for the detection of elements.

In the present work, a simple analysis method was designed to detect over-represented novel hexanucleotides in the upstream sequences of four set of genes, belonging to co-related functional categories viz., Nitrogen fixation, symbiosis, Nitrogen metabolism and glutamate family in M. loti [Kaneko et al., 2000]. The upstream of these genes were analyzed for known transcription factor (TF) binding site(s) using prokaryotic transcription factor database (ooTFD) and identified motifs were further verified by statistical oligonucleotides analysis.

MATERIALS AND METHODS

Retrieval of genes

Retrieval of M. loti gene sets sequences [Kaneko et al., 2000], belonging to functional category viz., Symbiosis, Nitrogen metabolism, Glutamic acid family and Nitrogen fixation were taken from the Rhizobase server (). The purpose behind the selection of this gene was to relate it to different nitrogen metabolic pathways (Tables 1, 2, 3 and 4).

Retrieval of upstream sequences without overlapping ORF upstream

Due to the organization of genes into operons in bacteria, it is preferable to prevent overlap with upstream open reading frames (ORFs) and prevent including too many coding sequences. For example, in Escherichia coli, 25% of the genes have an upstream neighbour closer than 50 bp which may suggest that they belong to the same operons [van Helden J, 2003; Siegele et al., 1989; Bulyk et al., 2004].

Identification of known regulatory factor/ binding site

The upstream sequences of all four genes sets were analyzed for known TF-binding sites through ‘Tfsitescan’ search tool at object oriented transcription factors database (ooTFD) server [Ghosh D, 2000] (Tables 5, 6, 7 and 8).

Clustering of genes on the basis of identified TF- binding site(s) family: All studied sets of genes were categorized into regulatory families on the basis of identified known transcription factor/ binding site(s) along with corresponding known sites, their length and binding known transcription factor, if any (Table 9). The details of each known transcription factor/ binding site with their references are summarized in Table 10.

Hexanucleotide analysis

In M. loti, detection of potential upstream hexanucleotide patterns in each set were performed along with genome wide pattern (s) identification. Modules for oligo analysis can be accessed through Regulatory Sequence Analysis Tools (RSAT) server [van Helden et al., 2000; van Helden J, 2003] at web interface . In the analysis, different statistical parameters were used for validation of true positive predictions. According to van Helden J. (1998) different statistics parameters can be defined as:

Expected occurrences (exp_occ): the number of occurrences expected for the considered oligonucleotide within the set of sequences. The calculation of this value depends on the probabilistic model.

Occurrence probability (occ_pro): the probability to have N or more occurrences, given the expected number of occurrences (where N is the observed number of occurrences).

Expected matching sequences (exp_ms): the expected number of sequences with at least one occurrence.

Matching sequence probability (ms_pro): the probability to have L or more sequences with at least one ocurrence of the oligonucleotide, given the probabilistic model (where L is the observed number of matching sequences).

Significance index (sig): this is a conversion of the occurrence probability, taking into account the number of possible oligonucleotides (which varies with oligo size) and doing a logarithmic transformation. The highest sig corespond to the most overrepresented oligonucleotide. Sig value higher than zero indicate overrepresentation.

PROBABILITIES

Various calibration models were used to estimate the probability of each oligonucleotide. From there, expected number of occurrences were calculated and compared to the observed number of occurrences. The significance of the observed number of occurrences is calculated with the binomial formulae (van Helden J., 1998):

Expected occurrences

Where, p = probability of the pattern

S = number of sequences in the sequence set.

Lj = length of the jth regulatory region

k = length of oligomer

T = the number of possible matching positions.

Probability of sequence matching

The probability to find at least one occurrence of the pattern within a single sequence is:

with the same abbreviations as above.

Expected number of matching sequences

In this counting mode, only the first occurrence of each sequence is taken into connsideration. We have thus to calculate a probability of first occurrence.

with the same abbreviations as above.

Correction for autocorrelation (from Mireille Regnier)

Where,

a = is the coefficient of autocorrelation

Probability of the observed number of occurrences (binomial)

The probability to observe exactly ‘obs’ occurrences in the whole family of sequences is calculated by the binomial as:

Where,

obs = is the observed number of occurrences,

p = is the expected frequency for the pattern,

T = is the number of possible matching positions,

The probability to observe ‘obs’ or more occurrences in the whole family of sequences is calculated by the sum of binomials:

E-value

The probability of occurrence by itself is not fully informative, because the threshold must be adapted depending on the number of patterns considered. The E-value represented the expected number of patterns which would be returned at random for a given P-value (probability).

Where,

NPO =is the number of possible oligomers of the chosen length.

Significance index or coefficient (“sig”)

The significance index is simply a negative logarithm conversion of the E-value (in base 10). The significance indexes are calculated as follows:

This index value is very convenient to interpret: highest values correspond to the most exceptional patterns.

RESULTS

Bacterial metabolism has been widely studied and provides many examples of known regulons [Fisher et al., 1988; Hovey AK & Frank DW, 1995; Householder et al., 1999; Bulyk et al., 2004]. On many cases the TF involved in the common response is known, as well as its binding site [Ow DW et al., 1983]. These families of coregulated genes provide ideal datasets to calibrate the analyses method which could be extended to families whose regulatory elements are unknown. In the studied work several transcription factor families were predicted on the basis of DNA motif conservation. On a co-related nitrogen element metabolic pathways criteria basis, genes were grouped together without a priori consideration on the content of their upstream regions and then analyzed for detection of potential known TFs and their binding sites (Table 9) and then further verified through comparison with detected hexanucleotides in each studied genes sets (Table 11, 12 and 13). Finally refinement of noise has been performed on the basis of common genes having both known TF binding sites and also hexanucleotide patterns (Table 14). In the present studied work results revealed that finally identified common genes with known and unknown novel binding sites or hexanucleotide patterns should have important role in the regulation of nitrogen metabolic pathways, thus identified genes were responsible for regulation of nitrogen fixing symbiosis process between nitrogen fixing bacteria and host leguminous plants. The descriptions of results are as follows:

Clusters of overlapping hexanucleotides reveal wider regulatory sequences

For each data set (Table 1,2,3,4) we extracted -400 bp upstream sequences relative to transcription start site and performed hexanucleotide analyses as described in methodology. To avoid false positive predictions all hexanucleotide patterns in all upstream sequences were retained by setting standard statistical parameters e.g. significance coefficient (briefly “sig”) > 0 and number of occurrences (briefly “occ”) ≥ 4 and with the chosen cut-off parameter very few genes upstream sequences showed hexanucleotide patterns e.g. 37 out of total 167 genes upstream in all four sets (Table 11 and 12). Different significant statistical values corresponding to each hexanucleotide patterns were summarized in Table 13. Finally genes upstream with both hexanucleotide patterns and known TF-binding sites were retained e.g. 20 out of total 37 genes which showed 8 known TF-binding sites (Table 14). In most pattern clusters families, the hexanucleotides with higher significance coefficient and occurrences were assumed to be the novel regulatory binding sites. Highly significant patterns were generally appeared in clusters with few additional overlapping hexanucleotide patterns that have a weaker significant coefficient e.g., hexanucleotide of MalT family e.g. GGCAGA (sig=0.32), which can be grouped with another strongly overlapping sequence: CCCCAC (sig=0.62). When combined the two patterns, most significant hexanucleotide pattern correspond to 7 bp conserved sequence e.g. CGGCAGA, which was highly matched with 6 bp known TF binding consensus sequence of Malt_Cs. Similarly in ExsA family, the most salient hexanucleotide was ATAAAA (sig=2.70) which can be grouped with other strongly overlapping sequence e.g. AAACGT (sig=1.12). When combined the two patterns, most significant hexanucleotide correspond to 9 bp conserved sequence e.g. ATAAAACGT, highly matched with 8 bp long known TF binding consensus sequence of ExsA_Cs transcription factor (TNAAAANA). In most families viz. MalT, PhoP, ExsA and MalT_malPp, the overlapping clusters reflect the fact that recognition domain of the transcription factor is wider than 6 nucleotides with conserved core region of hexanucleotides. The maximum significance coefficient value indicates the most conserved hexanucleotides core that usually corresponds to bases directly interacting with the transcription factor. The decrease of significance value for the lateral overlaps comes from the fact that these positions are less crucial for the TF binding.

Transcription Factor Families covering clusters of variable patterns

On the basis of clusters of pattern three TF families were categorized viz. MalT, PhoP and ExsA (Table 14). Description of each family and related genes are as follows:

1. MalT family

In the study 2 independent clusters of binding site (hexanucleotide pattern) for Malt_Cs transcription factor [Raibaud et al., 1985] were detected in which cluster-I showed higher affinity scoring pattern GGCAGA (sig=0.32 & occ=8) corresponding to known binding site (e.g. GGAKGA) for Malt_Cs TF and cluster-II with pattern ATAAAA (sig=2.70 & occ=6) defines low affinity consensus for the same TF.

2. PhoP family

For PhoP factor, it is reported that PhoP-PhoR two-component regulatory system controls the phosphate deficiency response in Bacillus subtilis. A number of pho regulon genes which require PhoP for activation or repression have been identified [Eder et al., 1999]. A similar situation was observed here in the PhoP family, where 2 independent clusters of binding site for PhoP TF were detected in which cluster-I showed higher affinity scoring pattern CGATCG (sig=1.39 & occ=6) corresponding to known binding consensus (e.g. TTHACA) for PhoP TF and cluster-II with pattern ATAAAA (sig=2.70 & occ=6) showed low affinity to the same TF.

3. ExsA family

It is reported that ExsA has been implicated as a central regulator of exoenzyme production by Pseudomonas aeruginosa [Hovey AK & Frank DW, 1995]. In the study we identified hexanucleotide pattern for the same TF. In ExsA family cluster-I showed high affinity pattern ATAAAA (sig=2.70 & occ=6) corresponding to known binding consensus (e.g. TNAAAANA) for the ExsA_Cs transcription factor, while cluster-II defines low affinity scoring consensus GGGATA (sig=0.42 & occ=4) for the same TF.

Putative unknown regulatory sites

Besides, known regulatory sites few unknown additional hexanucleotides were observed within families (Table 14). Based on the results for the known regulatory sites one can inferred that the ideal unknown site should appear as a cluster of overlapping hexanucleotides with higher significance coefficient. Fit with these criteria several unknown regulatory patterns were extracted from the hexanucleotide analysis and were considered as good candidates for new unknown regulatory sites on the basis of conservation. Putative predicted families of unknown hexanucleotides are as follows:

A. Families with cluster of unknown hexanucleotide patterns

We analysed 4 putative regulatory families viz. MalT, PhoP, ExsA & MalT_malPp where similar cluster of 2 hexanucleotides e.g. ATAAAA & AAACGT were appeared in MalT, PhoP and MalT_malPp regulatory families, in which pattern ATAAAA was considered highly significant due to higher significance coefficient value and number of occurrences (sig=2.70 & occ=6). Similarly, a cluster of 2 hexanucleotides e.g. GGGATA & GGCAGA appeared in the ExsA family, in which pattern GGGATA was considered highly significant due to higher significance & occurrence values (sig=0.42 & occ=6). Here varying oligonucleotide size revealed the expectation of same pattern with flanking nucleotides.

B. Families with single unknown hexanucleotide pattern

Total four putative regulatory families viz. MomR/oxyR, CAP/CRP, Nitrogen regulatory and Lambda were detected with single unknown hexanucleotide pattern. These are explained as follows:

1. MomR / oxyR family

It is reported that MomR protein is identical to OxyR, a regulatory protein responding to oxidative stress [Bolker M & Kahmann R, 1989]. Here, in MomR/oxyR family single hexanucleotide pattern e.g. AGCTTG was appeared in upstream sequence of Mlr6175 gene related to symbiosis and encode chitooligosaccharide deacetylase/nodulation protein; NodB, with lower significance & occurrence value (sig=0.19 & occ=7) and thus showed low affinity to known binding consensus sequence i.e. ATGCATCRW for the same e.g. MomR/oxyR_Cs transcription factor .

2. CAP/CRP family

Cyclic AMP (cAMP) and its receptor protein (CRP) have dual role in the regulation of the two promoters that control the galactose (gal) operon of E. coli [Taniguchi et al., 1979]. Here in CAP/CRP family single hexanucleotide pattern e.g. AATTCG was detected & found in the upstream sequence of nitrogen fixation related gene Mll4698, responsible to encode two-component system histidine protein kinase (FixL like), with pattern significance coefficient value of 0.28 and 7 occurrences, thus showed low affinity to known binding consensus e.g. ACACTTT for known TF (CAP/CRP-lac).

3. Nitrogen regulatory site family

In Nitrogen regulatory family single hexanucleotide pattern e.g. GGCAGA was detected in Glutamate family gene’s upstream Mlr1730, responsible to encode histidine ammonia-lyase, with significance coefficient of 0.32 and 8 occurrences. This defines low affinity to the known binding consensus TTTTGCA [Ow DW et al., 1983].

4. Lambda site family

In Lambda family, single hexanucleotide pattern ATTACC was detected in genes upstream related to symbiosis functional category e.g. gene Mll9683 responsible to encode protein AtsE, with significance coefficient 0.15 and 4 occurrences. It defines low affinity to the known binding consensus GGYGTRYG, thus expected as unknown regulatory site. For lambda protein it is reported that transcription anti-termination by the bacteriophage lambda-N protein is stimulated in vitro by the E. coli NusG protein [Zhou et al., 2002].

Known regulatory sites not detected through hexanucleotide analysis

Hexanucleotide analysis enabled us to detect 16 known regulatory sites out of 12 classified regulatory families in M.loti. Four known sites escaped detection through hexanucleotide analysis e.g. (i) binding site for Nod-factor in Nod family, (ii) TATA-box for RNA polymerase sigma factor in TATA-box family, (iii) binding site for NR(I) factor in NR(I) family and (iv) binding site for NarL/NarP-Cs factor in the NarL/NarP family. Contrary to all other families, not a single hexanucleotide had positive significance coefficient in these families.

DISCUSSION

It is well established that the nitrogen fixing symbiosis process between rhizobia and legumes are important for sustainable agricultural practices and contribute significant to the global nitrogen cycle. M. loti, the bacteria of rhizobia class make symbiosis with model legume plant L. japonicus. The genome of M. loti is completely sequenced [Kaneko et al., 2000] and the genome sequencing project of L. japonicus is under way. With the completion of genome sequencing project of L. japonicus, the molecular analysis of symbiotic relationship between these two can easily be studied. Particularly, the emphasis is required to be given on functional as well as regulatory genomics of M. loti. The genome sequencing data of M. loti has facilitated the availability of annotated genes classified under specific metabolic processes [Kaneko et al., 2000]. From nitrogen fixation point of view we have initially taken four set of genes facilitating symbiosis, nodulation, nitrogen fixation and glutamic acid metabolism. It is pertinent to mention that M. loti with 7.03 Mb genome size carries a 500 Kb transposable symbiotic island comprises clusters of genes responsible for making symbiosis as well as nodulation in L. japonicus [Kaneko et al., 2000]. The genes involved in these process have been annotated and being proved experimentally; whereas DNA motif involved in regulation of expression of functional genes have not been well studied. In the present work, a simple computational method has been optimized to identify upstream motifs relevant to gene regulation.

A total of 57 genes were grouped in the first symbiotic set and traced for known regulatory TF binding sites through TF-binding site detection tool (e.g. Tfsitescan at ooTFD server) in the corresponding genes upstream sequences. The accuracy of true positive prediction was accountable by significance coefficient (sig) value measured for individual TF binding site. A higher ‘sig’ value and maximum occurrence indicates significant prediction. Finally we identified TF binding sites for only 21 genes; all of them showed single TF binding site occurrences except two genes which showed twice number of occurrences e.g. known PhoP-consensus site responsible for regulation of nodulation protein expression i.e. NoeK or phosphomannomutase (mll7567) and known MalT-CS site responsible to regulate expression of glutamine fructose-6 phosphate transaminase nodulation protein i.e. NodM (mlr6386). True positive predictions were statistically supported by higher range of expectation value (expec) e.g. from 1.20e-02 to 7.26e-02.

Similarly in the second set of nitrogen metabolism total 23 genes were analyzed for identification of known regulatory TF binding sites. Only 8 genes showed known binding sites and except one, all of them had single number of site occurrences in their upstream sequences. The known PhoP-consensus site responsible for regulating expression of nitrogen regulatory protein, P-II i.e. GlnK (mll4247) had two occurrences. In this set higher value of expectation value were ranges from 1.06e-02 to 5.14e-02.

In the third set of glutamate family total 35 genes were analyzed for identification of known regulatory TF binding sites, in which only 10 genes showed known binding sites. Out of 10, 2 genes i.e. glutamate synthetase-I (mll0313) and glutamate synthetase beta-subunit (mll1646) showed twice occurrences of known binding sites e.g. MalT-CS and ExsACS respectively. The expectation value ranges from 1.45e-01 to 5.46e-02.

Finally in the fourth set a total of 54 genes were grouped in nitrogen fixation set which were analyzed for known TF binding sites through known database i.e. TFD. Identified TF binding sites for 21 genes mostly showed binding patterns with single occurrences but gene mll5857 i.e. Nif-specific regulatory protein; nifa, showed 2 occurrences for its 2 types of the binding patterns i.e. ExsACS_(1) & ExsACS_(2). Here the expectation value of predicted occurrences ranges from 1.08e-01 to 7.56e-02. All the identified known binding sites belonging to above 4 functional categories were further verified and confirmed by oligonucleotide analysis. Besides, genes identified with multiple patterns of known binding sites showed significant number of occurrences, thus revealed to be the most potential binding sites.

Moreover, results of symbiosis genes set hexanucleotides analysis (Table 12) significantly showed four additional hexanucleotide patterns i.e. CCCCCA, CCCCAC, ATTACC & AGCTTG predicted to be responsible for regulation of two genes i.e. mll1107 & mll4979, single gene i.e. mll4979* (* means putative significant gene with known TF/site in their upstream sequence), three genes i.e. mll9683*, mlr2437 & mlr5801 and four genes i.e. mlr6175*, mlr6341, mlr6622 & mlr7575 respectively. Similarly in nitrogen metabolism genes set only one new additional unknown hexanucleotide pattern i.e. GAGCAC was detected, predicted to be potential binding site for regulating three genes i.e. mll1423, mll4247 & mlr1320. On the other hand nitrogen fixation genes set showed three new additional unknown hexanucleotide patterns i.e. AATTCG, CAGGGA and CGATCG which were statistically verified as high affinity binding sites responsible for controlling expression of two genes i.e. mll3694* & mll4698*, four genes i.e. mlr3659*, mlr5871, mlr5906 & mlr5907 and two genes i.e. msl6623* & msr6418 respectively while glutamate family genes set showed four new additional unknown binding sites i.e. ATAAAA, AAACGT, GGGATA and GGCAGA, predicted to be responsible for expression regulation of four genes i.e. mll3030*, mll1646*, mll0343* & mll0601, three genes i.e. mll3030*, mll1560 & mll1646*, three genes i.e. mll3040, mll3074 & mll7254* and five genes i.e. mll9226, mlr0039*, mlr0339, mlr1730*, mlr3506 & mlr6209* respectively.

Finally eight consensus oligonucleotide patterns were identified and further verified by matching with known binding sites i.e. ACACTTT, GGAKGA, TTHACA, TNAAAANA, TCCTCC, ATGCATCRW, TTTTGCA, GGYGTRYG which were earlier reported as known binding sites of known transcription factors/ TF binding site, namely CAP/CRP, MalT, PhoP, ExsA, MalT_MalPp site*, MomR/oxyR, Nitrogen_regulatory site* and Lambda site* (* means known sites with unknown TF) respectively (Table 14).

CONCLUSIONS

The present study covered wide range of annotated genes participating in different nitrogen related metabolic pathways and statistically analyzed by their co-regulation coherence with potency of TF binding to related cis-element. Predicted TF binding sites were satisfactorily validated by higher significant statistical values and further verified by matching with known TF binding sites. Finally eight families of known TF binding sites were recognized along with recognition of high affinity new hexanucleotide patterns for studied gene sets. Such findings in the genome of M. loti may lead for more intricate analysis of regulatory network involved in symbiosis process between rhizobia and host plant L. japonicus.

ACKNOWLEDGEMENTS

We acknowledge the Council of Scientific & Industrial Research (CSIR), New Delhi for financial support as a SRF (Biotechnology) and also All India Council for Technical Education (AICTE), New Delhi for financial support as M.Tech. Biotechnology Teaching Programme at Department of Biotechnology (A Centre of Excellence in Biotechnology), Institute of Engineering and Technology, Lucknow (U.P.), India.

REFERENCES

| |Bolker M and Kahmann R (1989). The Escherichia coli regulatory protein oxyr discriminates between methylated and |

| |unmethylated states of the phage Mu mom promoter. EMBOJ, 8(8):2403-10 |

| |Bulyk ML, McGuire AM, Masuda N and Church GM (2004). A motif co-occurrence approach for genome-wide prediction of |

| |transcription-factor-binding sites in Escherichia coli. Genome Research, 14 (2):201-208. |

| |Cardon LR and Stormo GD (1992). Expectation maximization algorithm for identifying protein-binding sites with variable |

| |lengths from unaligned DNA fragments. Journal of Molecular Biology, 223:159-170. |

| |DeRisi JL, Iyer VR and Brown PO (1997). Exploring the metabolic and genetic control of gene expression on a genomic |

| |scale. Science, 278:680-686. |

| |Eder S, Liu W and Hulett FM (1999). Mutational analysis of the phod promoter in Bacillus subtilis: implications for phop|

| |binding and promoter activation of Pho regulon promoters. Journal of Bacteriology, 181(7):2017-25. |

| |Fisher RF, Egelhoff TT, Mulligan JT and Long SR (1988). Specific binding of proteins from Rhizobium meliloti cell free |

| |extracts containing nodD to DNA sequences upstream of inducible nodulation genes. Genes Dev., 2(3):282-93. |

| |Galas DJ, Eggert M and Waterman MS (1985). Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter |

| |sequences from E. Coli. Journal of Molecular Biology, 186(1):117—128. |

| |Ghosh D (2000). Object-oriented transcription factors database (ooTFD). Nucleic Acids Research, 1; 28(1): 308-10. |

| |Goffeau A, Park J, Paulsen IT, Jonniaux JL, Dinh T, Mordant P, and Saier MH Jr. (1997). Multidrug-resistant transport |

| |proteins in yeast: complete inventory and phylogenetic characterization of yeast open reading frames within the major |

| |facilitator superfamily. Yeast, 13:43-54. |

| |Hertz G and Stormo G (1995). Identification of Consensus Patterns in Un- aligned DNA and Protein Sequences: A |

| |Large-Deviation Statistical Basis for Penalizing Gaps. Proceedings of the 3rd International Conference on Bioinformatics|

| |and Genome Research, p201-216. |

| |Hertz GZ, Hartzell GW and Stormo GD (1990). Identification of consensus patterns in unaligned DNA sequences known to be |

| |functionally related. Computer Applications in the Biosciences, 6:81-92. |

| |Householder TC, Belli WA, Lissenden S, Cole JA and Clark VL, (1999). Cis- and trans-acting elements involved in |

| |regulation of ania, the gene encoding the major anaerobically induced outer membrane protein in Neisseria gonorrhoeae. |

| |Journal of Bacteriology, 181(2):541-51. |

| |Hovey AK and Frank DW (1995). Analyses of the DNA binding and transcriptional activation properties of exsa, the |

| |transcriptional activator of the Pseudomonas aeruginosa exoenzyme S regulon. Journal of Bacteriology, 177(15):4427-36. |

| |Kaneko T. et al. (2000). Complete genome structure of the nitrogen-fixing symbiotic bacterium Mesorhizobium loti. DNA |

| |Research, 7:331-338 |

| |Lashkari DA, DeRisi LJ, McCusker JH, Namath AF, Gentile C, Hwang SY, Brown PO and Davis RW (1997). Yeast microarrays for|

| |genome wide parallel genetic and gene expression analysis. Proceedings of National Academy of Sciences, USA (PNAS), |

| |94:13057-13062. |

| |Lawrence C and Reilly A (1990). An expectation maximization (EM) algorithm for the identification and characterization |

| |of common sites in unaligned biopolymer sequences. Proteins, 7 (1):41-51. |

| |Lawrence C, Altschul S, Boguski M, Liu J, Neuwald A and Wootton J (1993). Detecting subtle sequence signals: a Gibbs |

| |sampling strategy for multiple alignment. Science, 262 (5131):208-14. |

| |Mengeritsky G and Smith TF (1987). Recognition of characteristic patterns in sets of functionally equivalent DNA |

| |sequences. Bioinformatics, Vol 3:223-227. |

| |Neuwald AF, Liu JS and Lawrence CE (1995). Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. |

| |Protein Science, 4: 1618-1632. |

| |Ow DW, Sundaresan V, Rothstein DM, Brown SE and Ausubel FM (1983). Promoters regulated by the glng (ntrc) and nifa gene |

| |products share a heptameric consensus sequence in the –15 region. Proceedings of National Academy of Sciences, USA, |

| |80(9): 2524-8. |

| |Raibaud O, Gutierrez C and Schwartz M (1985). Essential and nonessential sequences in malpp, a positively controlled |

| |promoter in Escherichia coli. Journal of Bacteriology, 161(3):1201-8. |

| |Reitzer LJ and Magasanik B (1986). Transcription of glna in Escherichia coli is stimulated by activator bound to sites |

| |far from the promoter. Cell, 20; 45(6):785-92. |

| |Schena M (1996). Genome analysis with gene expression microarrays. Bioassays, 18:427-431. |

| |Schena M, Shalon D, Davis RW and Brown PO (1995) Quantitative monitoring of gene expression patterns with a |

| |complimentary DNA microarray. Science, 270:467-470. |

| |Schena M, Shalon D, Heller R, Chai A, Brown PO and Davis RW (1996) Parallel human genome analysis: microarray-based |

| |expression monitoring of 1000 genes. Proceedings of National Academy of Sciences, USA, 93:10614-10619. |

| |Siegele DA, Hu JC, Walter WA and Gross CA (1989). Altered promoter recognition by mutant forms of the sigma 70 subunit |

| |of Escherichia coli RNA polymerase. Journal of Molecular Biology, 20; 206(4):591-603. |

| |Stormo GD and Hartzell GW (1989). Identifying protein-binding sites from unaligned DNA fragments. Proceedings of |

| |National Academy of Sciences, USA, 86:1183-1187. |

| |Taniguchi T, O’Neill M and de Crombrugghe B (1979). Interaction site of Escherichia coli cyclic AMP receptor protein on |

| |DNA of galactose operon promoters. Proceedings of National Academy of Sciences, USA, 76 (10): 5090-4. |

| |van Helden J (2003). Prediction of transcriptional regulation by analysis of the non-coding genome. Current Genomics, |

| |4(3):217-224. |

| |van Helden J (2003). Regulatory sequence analysis tools. Nucleic Acids Research, 31(13):3593-6. |

| |van Helden J, Andre B and Collado-Vides J (1998). Extracting regulatory sites from the upstream region of yeast genes by|

| |computational analysis of oligonucleotide frequencies. Journal of Molecular Biology, 281(5):827-42. |

| |van Helden J, Andre B and Collado-Vides J (2000). A web site for the computational analysis of yeast regulatory |

| |sequences. Yeast, 16(2):177-187. |

| |Waterman MS, Smith TF and Katcher HL (1984). Algorithms for restriction map comparisons. Nucleic Acids Research, Vol. |

| |12, Issue 1:237-242. |

| |Wolfertstetter F, Frech K, Herrmann G and Werner T (1996). Identification of functional elements in unaligned nucleic |

| |acid sequences by a novel tuple search algorithm. Computer Applications in the Biosciences, 12(1):71—80. |

| |Zhou Y, Filter JJ, Court DL, Gottesman ME and Friedman DL (2002). Requirement for nusg for transcription antitermination|

| |in vivo by the lambda N protein. Journal of Bacteriology, 184(12):3416-8. |

Annexure-Tables

Table1. Set of genes belonging to symbiosis functional category in M. loti.

Table2. Set of genes belonging to nitrogen metabolism functional category in M. loti.

Table3. Set of genes belonging to glutamate functional category in M. loti.

Table 4. Set of genes belonging to nitrogen fixation functional category in M. loti.

Table 5. List of identified known transcription factor/ binding site(s) in a set of M. loti symbiosis family genes upstream sequences.

|S.No. |Gene |Description |Site/TF |Length |Position |

| | | | | |Symbiosis |

|1 |CAP/CRP |Taniguchi et al., (1979) |AAGATGCGAAA |11 |Binding site for CAP/CRP complex in lac-operon of E.coli. cAMP-CRP play |

| | | |AAAGTGTGACA |11 |an important role in transcription initiation. |

| | | |TCCATGTCACA |11 | |

| | | |AAAGCGCTACA |11 | |

| | | |ACACTTT |7 | |

|2 |MalT-CS |Raibaud et al., (1985) |GGAKGA |6 |Binding site of malt on malPp promoter in E.coli. |

|3 |PhoP |Eder et al., (1999) |TTHACA |6 |Controls the phosphate deficiency response in Bacillus subtilis, & |

| | | | | |required for activation or repression of Pho regulon genes. |

|4 |ExsA |Hovey AK & Frank DW, (1995) |TNAAAANA |8 |A central regulator of exoenzymeS production by Pseudomonas aeruginosa. |

|5 |NOD-Box |Fisher et al., (1988) |ATCCAAACAATCRATTTTACCAATC |25 |Binding site of NodD protein, upstream of inducible nodulation genes in |

| | | | | |Rhizobium meliloti. |

|6 |MomR/oxyR CS |Bolker M & Kahmann R, (1989) |ATGCATCRW |9 |The E.coli regulatory protein OxyR discriminates between methylated & |

| | | | | |unmethylated states of the phage Mu mom promoter. |

|7 |Nitrogen-regulatory |Ow et al., (1983) |TTTTGCA |7 |Promoters regulated by the glnG (ntrC) and nifA gene products. Known |

| |site* | | | |site only. |

|8 |Lambda-C site* |Zhou et al., (2002) |GGYGTRYG |8 |Transcription antitermination by the bacteriophage lambda N protein is |

| | | | | |stimulated in vitro by the Escherichia coli NusG protein. Known site |

| | | | | |only. |

|9 |TATA- Box |Siegele et al., (1989) |TATAAT |8 |Binding site of RNA polymerase, sigma70 subunit in E.coli. |

|10 |NR(I) |Reitzer LJ & Magasanik B (1986) |GCACGATGGTGC |12 |A regulatory protein, stimulate transcription at N2-regulated promoter |

| | | | | |glnAp2, act as enhancer in E.coli. |

|11 |NarL/NarP-CS |Householder et al., (1999) |TACYNMT |7 |The gonococcal FNR and NarP homologs are involved in the regulation of |

| | | | | |aniA. AniA (formerly Pan1) is the major anaerobically induced outer |

| | | | | |membrane protein in Neisseria gonorrhoeae. |

|12 |MalT_malPp* |Raibaud et al., (1985) |TCCTCC |6 |Conserved known site in malPp, a positively controlled promoter in |

| | | | | |Escherichia coli. Known site only. |

| | | | | |Binding site of malt on malPp promoter in E.coli. |

Table 11. List of identified M.loti potential upstream regulatory hexanucleotide patterns in all sets of genes detected through oligonucleotide analysis.

|Hexanucleotide |Sequence ID |Strand |Start position | End position|Matching word |Score |

|Pattern | | | | | | |

|Functional gene category: Glutamate family |

|AAACGT |mll0343 |D |-44 |-39 |caaaAAACGTcatc |1.12 |

| |mll0343 |D |-27 |-22 |caaaAAACGTcatc |1.12 |

| |mll0601 |D |-19 |-14 |gtttAAACGTgatg |1.12 |

| |mll1646 |D |-44 |-39 |aataAAACGTgcta |1.12 |

| |mll3030 |R |-348 |-343 |gagaAAACGTaagc |1.12 |

|ATAAAA |mll1560 |R |-24 |-19 |cgcaATAAAAtagc |2.70 |

| |mll1646 |D |-55 |-50 |attgATAAAAaaat |2.70 |

| |mll1646 |D |-47 |-42 |aaaaATAAAAcgtg |2.70 |

| |mll1646 |R |-101 |-96 |cttgATAAAAacat |2.70 |

| |mll3030 |D |-260 |-255 |tgccATAAAAattt |2.70 |

| |mll3030 |R |-324 |-319 |cgacATAAAAtttc |2.70 |

|GGGATA |mll3040 |R |-161 |-156 |ggacGGGATAtc |0.42 |

| |mll3074 |R |-11 |-6 |gaaaGGGATAaacg |0.42 |

| |mll7254 |D |-54 |-49 |atcaGGGATAgcgc |0.42 |

| |mll7254 |R |-141 |-136 |agctGGGATAtcgg |0.42 |

|GGCAGA |mll9226 |D |-79 |-74 |ctcaGGCAGAacgg |0.32 |

| |mlr0039 |D |-38 |-33 |ggaaGGCAGAcgcc |0.32 |

| |mlr0039 |D |-385 |-380 |ttgaGGCAGAtttg |0.32 |

| |mlr0039 |D |-56 |-51 |cgacGGCAGAcgtg |0.32 |

| |mlr1730 |D |-124 |-119 |ggcaGGCAGAgcga |0.32 |

| |mlr1730 |R |-375 |-370 |ctccGGCAGAcgcg |0.32 |

| |mlr3506 |R |-44 |-39 |aaatGGCAGAcgga |0.32 |

| |mlr6209 |D |-354 |-349 |tggaGGCAGAacct |0.32 |

|Functional gene category: Nitrogen fixation |

|AATTCG |mll3694 |D |-358 |-353 |ggcgAATTCGagcg |0.28 |

| |mll3694 |D |-199 |-194 |gaggAATTCGtcct |0.28 |

| |mll3694 |D |-179 |-174 |caagAATTCGcaat |0.28 |

| |mll3694 |D |-111 |-106 |tcgcAATTCGgcgc |0.28 |

| |mll3694 |R |-360 |-355 |ctcgAATTCGccgc |0.28 |

| |mll3694 |R |-309 |-304 |acgcAATTCGacgc |0.28 |

| |mll4698 |R |-297 |-292 |gttcAATTCGcagg |0.28 |

| |mll4698 |R |-268 |-263 |ctacAATTCGtcgg |0.28 |

|CAGGGA |mlr3659 |D |-7 |-2 |caagCAGGGAt |0.10 |

| |mlr5871 |R |-282 |-277 |cgtcCAGGGAgagc |0.10 |

| |mlr5906 |D |-92 |-87 |ccgaCAGGGAggcg |0.10 |

| |mlr5907 |D |-111 |-106 |caggCAGGGAggcc |0.10 |

| |mlr5907 |R |-122 |-117 |ctgcCAGGGAggcc |0.10 |

| |mlr5907 |R |-46 |-41 |agcaCAGGGAcatc |0.10 |

|CGATCG |msl6623 |D |-340 |-335 |gtctCGATCGcacc |1.39 |

| |msl6623 |D |-205 |-200 |atcgCGATCGccat |1.39 |

| |msl6623 |D |-133 |-128 |gcggCGATCGccat |1.39 |

| |msl6623 |D |-8 |-3 |ggccCGATCGtc |1.39 |

| |msr6418 |D |-337 |-332 |gtctCGATCGcgcc |1.39 |

| |msr6418 |D |-196 |-191 |attgCGATCGtcta |1.39 |

| |msr6418 |D |-130 |-125 |gcggCGATCGcgat |1.39 |

|Functional gene category: Nitrogen metabolism |

|GAGCAC |mll1423 |D |-341 |-336 |gggcGAGCACatgc |1.66 |

| |mll1423 |R |-282 |-277 |gcagGAGCACagcg |1.66 |

| |mll1423 |R |-260 |-255 |ggccGAGCACcgac |1.66 |

| |mll1423 |R |-188 |-183 |cctcGAGCACgccg |1.66 |

| |mll4247 |D |-190 |-185 |ttttGAGCACatgg |1.66 |

| |mll4247 |R |-239 |-234 |ctttGAGCACgatc |1.66 |

| |mlr1320 |D |-150 |-145 |gaaaGAGCACcccg |1.66 |

|Functional gene category: Symbiosis |

|CCCCCA |mll1107 |R |-95 |-90 |agtgCCCCCAgtct |0.64 |

| |mll4979 |D |-171 |-166 |gcagCCCCCAcctc |0.64 |

| |mll4979 |D |-111 |-106 |cgcaCCCCCAcccc |0.64 |

| |mll4979 |D |-47 |-42 |cggaCCCCCAcaag |0.64 |

| |mll4979 |R |-128 |-123 |ccgaCCCCCAcccg |0.64 |

|CCCCAC |mll4979 |D |-170 |-165 |cagcCCCCACctcc |0.62 |

| |mll4979 |D |-110 |-105 |gcacCCCCACcccg |0.62 |

| |mll4979 |D |-46 |-41 |ggacCCCCACaagg |0.62 |

| |mll4979 |R |-155 |-150 |acctCCCCACaagg |0.62 |

| |mll4979 |R |-129 |-124 |cgacCCCCACccgg |0.62 |

|ATTACC |mll9683 |R |-291 |-286 |catgATTACCgcga |0.15 |

| |mlr2437 |D |-307 |-302 |cccgATTACCgtga |0.15 |

| |mlr5801 |D |-392 |-387 |atccATTACCcaag |0.15 |

| |mlr5801 |D |-38 |-33 |caacATTACCccac |0.15 |

|AGCTTG |mlr6175 |R |-129 |-124 |atgcAGCTTGcgcc |0.19 |

| |mlr6175 |R |-101 |-96 |ggggAGCTTGtcgc |0.19 |

| |mlr6341 |D |-322 |-317 |cttaAGCTTGtctc |0.19 |

| |mlr6622 |R |-207 |-202 |tgtcAGCTTGctc |0.19 |

| |mlr6622 |R |-178 |-173 |tcgcAGCTTGagct |0.19 |

| |mlr7575 |D |-336 |-331 |gcggAGCTTGcagt |0.19 |

| |mlr7575 |D |-78 |-73 |gcgaAGCTTGaacc |0.19 |

Table 12. Clustering of M.loti genes along with corresponding identified upstream hexanucleotide patterns in each functional category genes sets with significant statistical values.

|Functional category|Genes |Pattern |Observed frequency|Expected frequency|Occ |Sig value |Ms |

|1.Symbiosis |mll1107 mll4979 |CCCCCA |0.00343 |0.000317 |5 |0.64 |2 |

| |mll4979* |CCCCAC |0.00343 |0.000321 |5 |0.62 |1 |

| |mll9683* mlr2437 |ATTACC |0.00210 |0.00017 |4 |0.15 |3 |

| |mlr5801 | | | | | | |

| |mlr6175* mlr6341 |AGCTTG |0.003419 |0.000618 |7 |0.19 |4 |

| |mlr6622 mlr7575 | | | | | | |

|2.Nitrogen |mll1423 mll4247 |GAGCAC |0.004149 |0.000435 |7 |1.66 |3 |

|metabolism |mlr1320 | | | | | | |

|3.Glutamic acid |mll3030* mll1646* |AAACGT |0.004039 |0.000295 |5 |1.12 |4 |

|family |mll0343* mll0601 | | | | | | |

| |mll3030* mll1560 |ATAAAA |0.00485 |0.000257 |6 |2.70 |3 |

| |mll1646* | | | | | | |

| |mll3040 mll3074 |GGGATA |0.004124 |0.000286 |4 |0.42 |3 |

| |mll7254* | | | | | | |

| |mll9226 mlr0039* |GGCAGA |0.003269 |0.000654 |8 |0.32 |6 |

| |mlr0339 mlr1730* | | | | | | |

| |mlr3506 mlr6209* | | | | | | |

|4.Nitrogen fixation|mll3694* mll4698* |AATTCG |0.003812 |0.000667 |7 |0.28 |2 |

| |mlr3659* mlr5871 |CAGGGA |0.003193 |0.000497 |6 |0.10 |4 |

| |mlr5906 mlr5907 | | | | | | |

| |msl6623* msr6418 |CGATCG |0.008480 |0.000812 |6 |1.39 |2 |

Abbreviations: ms= number of matching sequences, i.e. the number of sequences from the family which contain at least one occurrence of the pattern, occ= number of occurrences of the pattern among all upstream regions from the family, sig= significance coefficient or index value. Here ‘*’ means significant gene with known TF/site in their upstream sequence.

Table 13. Details of M. loti hexanucleotides statistical data resulted in all functional category genes sets.

|S.No. |

|1. |

|5. |

|8. |

|12. |gagcac |Gagcac/gtgctc |0.004149 |0.0004352 |

| |Pattern sequence |Ms |Occ |Exp |Sig |Consensus |Bound factor/ or site |N2-fixation |Symbiosis |N2-met |Glutamate | |1. |CAP/CRP |aattcg |2 |7 |1.20 |0.28 |ACACTTT |CAP/CRP-Lac |mll4698 |-- |-- |-- | |2. |MalT |ccccac |1 |5 |0.46 |0.62 |

GGAKGA |

Malt_Cs |-- |mll4979 |-- |-- | | | | ggcaga |6 |8 |1.57 |0.32 | | |-- |-- |-- |mlr0039 | | | |aattcg |2 |7 |1.20 |0.28 | | |mll3694 |-- |-- |-- | | | | attacc |3 |4 |0.32 |0.15 | | |-- |mll9683 |-- |-- | | | |ataaaa |3 |6 |0.31 |2.70 | | |-- |-- |-- |mll3030 | | | | aaacgt |4 |5 |0.36 |1.12 | | |-- |-- |-- | | | | | aaacgt |4 |5 |0.36 |1.12 | | |-- |-- |-- |mll0343 | |3. |PhoP |ataaaa |3 |6 |0.31 |2.70 |

TTHACA |

PhoP |-- |-- |-- |mll3030 | | | | aaacgt |4 |5 |0.36 |1.12

| | |-- |-- |-- | | | | | cgatcg |2 |6 |0.53 |1.39 | | |msl6623 |-- |-- |-- | | | |ggcaga |6 |8 |1.57 |0.32 | | |-- |-- |-- |mlr6209 | | | | caggga |4 |6 |0.92 |0.10

| | |mlr3659 |-- |-- |-- | |4. |ExsA |ataaaa |3 |6 |0.31 |2.70 |TNAAAANA |ExaA_Cs_(1)

ExaA_Cs_(2) |-- |-- |-- |mll1646

| | | | aaacgt |4 |5 |0.36 |1.12 | | |-- |-- |-- | | | | | aaacgt |4 |5 |0.36 |1.12

| | |-- |-- |-- |mll0343 | | | |gggata |3 |4 |0.27 |0.42 | | |-- |-- |-- |mll7254 | | | |ggcaga |6 |8 |1.57 |0.32 | | |-- |-- |-- |mlr6209 | |

5. |

MalT_malPp* |ataaaa |3 |6 |0.31 |2.70 |TCCTCC |MalT_malPp site* |-- |-- |-- |

mll3030 | | | | aaacgt |4 |5 |0.36 |1.12 | | |-- |-- |-- | | | | | aaacgt |4 |5 |0.36 |1.12 | | |-- |-- |-- |mll0343 | |6. |MomR/oxyR |agcttg |4 |7 |1.24 |0.19 |ATGCATCRW |MomR/oxyR |-- |mlr6175 |-- |-- | |7. |Nitrogen_reg* |ggcaga |6 |8 |1.57 |0.32 |TTTTGCA |Nitrogen_reg site* |-- |-- |-- |mlr1730 | |8. |Lambda* |attacc |3 |4 |0.32 |0.15 |GGYGTRYG |Lambda_C site* |-- |mll9683 |-- |-- | |Abbreviations: ms= umber of matching sequences, i.e. the number of sequences from the family which contain at least one occurrence of the pattern, occ= number of occurrences of the pattern among all upstream regions from the family, exp= expected number of occurrences, sig= significance index or coefficient value, calculated as defined in Methodology. Here * means known site found through ooTFD database.

-----------------------

1. mll1026: rhizobiocin secretion protein; RspE

2. mll1027: rhizobiocin secretion protein; RspD

3. mll1107: outer membrane protein, NodT candidate

4. mll1143: nodulation protein N

5. mll1266: nodulation protein; NoeC

6. mll2768: acetyltransferase, nodulation protein; NodL

7. mll3788: weak similarity to NodH

8. mll4296: ferric leghemoglobin reductase-2 precursor, dihydrolipoamide dehydrogenase

9. mll4680: glycosyltransferase, contains similarity to NolL

10. mll4979: similar to MocC (rhizopine catabolism), also similar to myo-inositol catabolism; IolE

11. mll5320: virulence factor MviN-like protein

12. mll5661: nodulation protein; NoeI

13. mll5922: GDP-D-mannose dehydratase; nodulation protein; NoeL

14. mll6337: nodulation protein; NolX

15. mll6338: nodulation protein; NolW

16. mll6943: aquaporin, nodulin-like intrinsic protein

17. mll7567: nodulation protein NoeK, phosphomannomutase

18. mll9170: virulence factor SrfC homolog

19. mll9171: virulence factor SrfB homolog

20. mll9589: AtsE

21. mll9683: AtsE

22. mlr0024: HesB-like protein

23. mlr1006: rhizopine catabolism protein; ModC

24. mlr2192: O-acetyltransferase, NodL candidate

25. mlr2437: rhizopine catabolism protein; MocC

26. mlr3097: nodulation protein NodN-Rhizobium leguminosarum

27. mlr3249: glycosyltransferase; RedB [Sinorhizobium meliloti] megaplasmid 2

28. mlr4951: NodF

29. mlr4953: NodE

30. mlr5801: phosphomannomutase; NoeK

31. mlr5802: phosphomannose isomerase/GDP-mannose pyrophosphorylase; NoeJ

32. mlr5821: nodulation protein; NodF

33. mlr5822: nodulation protein; NodE

34. mlr5848: nodulation protein; NodZ

35. mlr5849: GDP-mannose 4,6-dehydratase; nodulation protein; NoeL

36. mlr6161: methyltransferase, nodulation protein; NodS

37. mlr6163: N-acetylglucosaminyltransferase, nodulation protein; NodC

38. mlr6164: nodulation ATP-binding protein; NodI

39. mlr6166: nodulation protein; NodJ

40. mlr6171: nodulation protein; NolO

41. mlr6175: chitooligosaccharide deacetylase, nodulation protein; NodB

42. mlr6339: nodulation protein; NolT

43. mlr6341: nodulation protein; NolV

44. mlr6386: glutamine-fructose-6-phosphate transaminase nodulation protein; NodM

45. mlr6622: similar to nodulin 21

46. mlr7028: opine oxidase subunit A

47. mlr7400: bacteroid development protein; BacA

48. mlr7575: nodulation protein NodP, sulfate adenylate transferase, subunit 2

49. mlr7576: nodulation protein NodQ, sulfate adenylate transferase, subunit 1

50. mlr7780: similar to Rhizopine catabolism protein mocD

51. mlr7850: nodulation protein nodG, 3-oxoacyl-(acyl carrier protein) reductase

52. mlr8749: GDP-L-fucose synthetase; nodulation protein; NolK

53. mlr8755: acyltransferase, nodulation protein; NodA

54. mlr8757: acetyltransferase, nodulation protein; NolL

55. mlr8764: nodulation protein; NolU

56. mlr9393: nodulation protein; NoeC

57. msr3202: integration host factor beta chain

1. mll0345: nitrogen regulatory protein P-II

2. mll1423: nitrile hydratase beta subunit

3. mll1425: nitrile hydratase alpha subunit

4. mll1732: naphthalene dioxygenase ferredoxin

5. mll3450: similar to nitrilase, nitrilase 1 like protein

6. mll4247: nitrogen regulatory protein P-II; GlnK

7. mll5100: Ferredoxin [2Fe-2S] I

8. mll6776: ornithine cyclodeaminase

9. mlr1320: nitrate/nitrite regulatory protein

10. mlr1729: ornithine cyclodeaminase

11. mlr2282: ornithine cyclodeaminase

12. mlr2862: nitrite reductase large subunit

13. mlr2863: nitrite reductase small subunit

14. mlr3204: ornithine cyclodeaminase; Ocd2

15. mlr3855: ferredoxin II

16. mlr4999: putative Rieske-like ferredoxin; MocE

17. mlr5869: ferredoxin 2[4Fe-4S] III; FdxB

18. mlr5930: ferredoxin 2[4Fe-4S] III; FdxB

19. mlr7139: ornithine cyclodeaminase

20. mlr7628: putative Rieske-like ferredoxin; MocE

21. msl0793: ferredoxin

22. msl8750: ferredoxin 2[4Fe-4S]; FdxN

21. msr9193: probable ferredoxin

1. mll0151: histidine ammonia-lyase

2. mll0343: glutamine synthetase I

3. mll0601: proline iminopeptidase

4. mll1160: proline dehydrogenase

5. mll1557: UDP-N-acetylmuramoylalanine-D-glutamate ligase

6. mll1560: DP-N-acetylmuramoylalanyl-D-glutamate-2,6-diaminopimelate ligase

7. mll1631: N-carbamoyl-beta-alanine amidohydrolase

8. mll1646: glutamate synthase beta subunit

9. mll3029: glutamate synthase, small subunit

10. mll3030: glutamate synthase, large subunit

11. mll3040: N-formylglutamate amidohydrolase

12. mll3074: glutamine synthetase

13. mll3461: glutamate N-acetyltransferase/amino-acid acetyltransferase

14. mll4011: glutamate 5-kinase

15. mll4187: glutamine synthetase

16. mll5148: probable glutamine synthetase

17. mll6498: pyrroline-5-carboxylate reductase

18. mll6521: glutamine synthetase

19. mll7124: histidine ammonia-lyase

20. mll7254: glutamine synthetase

21. mll7307: glutamine synthetase III

22. mll7308: glutamate synthase large subunit

23. mll9226: argininosuccinate lyase

24. mlr0039: glutamate racemase

25. mlr0339: glutamine synthetase II

26. mlr1730: histidine ammonia-lyase

27. mlr3506: argininosuccinate lyase

28. mlr4366: argininosuccinate synthase

29. mlr4826: acetylglutamate kinase (EC 2.7.2.8)

30. mlr6209: histidine decarboxylase

31. mlr6210: glutamine synthetase III

32. mlr6298: gamma-glutamyl kinase

33. mlr7698: glutamate-ammonia-ligase; adenylyltransferase

34. mlr8321: histidine ammonia-lyase; HutH

35. mlr8322: N-formylglutamate amidohydrolase; HutG

1. mll1670: NtrR

2. mll1671: NtrP

3. mll3694: transcriptional regulator, similar to FixK-Bradyrhizobium japonicum

4. mll4698: two-component system histidine protein kinase (FixL like)

5. mll5421: aminotransferase; NifS

6. mll5837: Nif-specific regulatory protein; NifA

7. mll5855: nitrogen fixation protein; NifB

8. mll5857: nif-specific regulatory protein; NifA

9. mll5860: nitrogen fixation protein; FixC

10. mll5861: nitrogen fixation protein; FixB

11. mll5862: nitrogen fixation protein; FixA

12. mll5864: nitrogenase stabilizer; NifW

13. mll5865: nitrogenase cofactor synthesis protein; NifS

14. mll5941: nitrogen fixation protein; NifU

15. mll6578: nitrogen fixation regulation protein; FixK

16. mll6606: two-component, nitrogen fixation regulatory protein; FixJ

17. mll6607: two-component, nitrogen fixation sensor protein; FixL

18. mll6624: nitrogen fixation protein; FixI

19. mll6625: nitrogen fixation protein; FixH

20. mll6626: nitrogen fixation protein; FixG

21. mll6628: cytochrome-c oxidase FixP chain

22. mll6629: cytochrome-c oxidase FixO chain

23. mll6630: cytochrome-c oxidase FixN chain

24. mll8252: nitrogen fixation protein gene

25. mlr0015: nitrogenase cofactor synthesis protein; NifS

26. mlr0021: NifS-like aminotransferase

27. mlr0396: nitrogen reguration protein; NifR3

28. mlr0397: nitrogen reguration protein; NirB

29. mlr0398: nitrogen assimilation regulatory protein; NtrC

30. mlr0399: nitrogen regulation protein; NtrY

31. mlr0400: nitrogen assimilation regulatory protein; NtrX

32. mlr2864: nitrate reductase large subunit

33. mlr3659: histidine protein kinase, similar to FixL

34. mlr5785: Nif-specific regulatory protein; NifA

35. mlr5871: nitrogen fixation protein; NifQ

36. mlr5905: nitrogenase iron protein; NifH

37. mlr5906: nitrogenase molybdenum-iron protein alpha chain; NifD

38. mlr5907: nitrogenase molybdenum-iron protein beta chain; NifK

39. mlr5908: nitrogenase molybdenum-cofactor synthesis protein; NifE

40. mlr5909: nitrogenase molybdenum-iron protein beta chain; NifN

41. mlr5911: nitrogenase molybdenum-iron protein; NifX

42. mlr6097: nitrogen assimilation control protein

43. mlr6411: cytochrome-c oxidase FixN chain

44. mlr6412: cytochrome-c oxidase FixO chain

45. mlr6414: cytochrome-c oxidase FixP chain

46. mlr6415: nitrogen fixation protein; FixG

47. mlr6416: nitrogen fixation protein; FixH

48. mlr6417: nitrogen fixation protein; FixI

49. mlr7805: homocitrate syntase NifV candidate

50. msl5859: ferredoxin like protein; FixX

51. msl6623: nitrogen fixation protein; FixS

52. msl6627: nitrogen fixation protein; FixQ

53. msr6413: cytochrome-c oxidase FixQ chain

54. msr6418: nitrogen fixation protein; FixS

Sig_occ = -log10 (E-value)

E-value = NPO * P(>=obs)

T obs-1

P(>=obs) = SUM P(i) = 1 - SUM P(i)

i=obs i=0

"%=OPQ[elmnyz{~…†Œ?ìÛÆìÆ´¢??r?dR?R??d?Ch‚ek5?B*[pic]OJQJ\?ph#hñ*¾hEaj5?B*[pic]OJQJ\?phhñ*¾hû"“5?H*[pic]OJQJh³-Æ5?B*[pic]OJQJ\?phhn™5?B*[pic]OJQJ\?ph#hñ*¾hû"“5?B*[pic]OJQJ\?ph#hñ*¾hœqS5?B*[pic]OJQJ\?ph#h[?]PŠhû"“CJOJQJ obs T-obs

P(obs) = bin(p,T,obs) = T! p (1-p)

---------------

obs! * (T-obs)!

Exp_ms_corrected = n (1 - (1 - p/a)^T)

Exp_ms = n (1 - (1 - p)^T)

T

q = 1 - (1-p)

S

Exp_occ = p * T = p * SUM (Lj + 1 - k)

j=1

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download