Clustering 16S rRNA for OTU prediction: A similarity based method

*Corresponding author: mcan@ius.edu.ba © The Author 2020. Published by ARDA. Abstract To study the phylogeny and taxonomy of samples from complex environments Next-generation sequencing (NGS)-based 16S rRNA sequencing, which has been successfully used jointly with the PCR amplification and NGS technology. First step for many downstream analyses is clustering 16S rRNA sequences into operational taxonomic units (OTUs). Heuristic clustering is one of the most widely employed approaches for generating OTUs in which one or more seed sequences to represent each cluster are selected. In this work we chose five random seeds for each cluster from a genes library, and we present a novel distance measure to cluster bacteria in the sample. Artificially created sets of 16S rRNA genes selected from databases are successfully clustered with more than %98 accuracy, sensitivity, and specificity.


Introduction
Bacteria play an important role in human health and disease [1]. In addition, they have an essential role in various biogeochemical activities. To understand the bacterial world around us, characterizing the taxonomic community composition taken from an environmental sample is very important [2] [3]. Most widely used biomarker for microbial community descriptions is the 16S rRNA (ribosomal RNA) marker genes generated by high-throughput sequencing technology [4]. Advanced sequencing technology can produce millions of 16S rRNA, bypassing the necessity of isolating single organisms for cultivation, and has become a powerful tool for in-depth analysis of bacterial community composition [5], [6].
For rapidly processing the 16S sequencing data, first step is to cluster them into the OTUs [7] [8], which form the basis for estimating the species, diversity, composition, and richness of the microbes in the environment [9] [10]. For binning 16S rRNA sequences there are two major approaches: a. taxonomy dependent methods, where each query sequence is compared against a reference taxonomy database and assigned to the organism of the best-matched annotated sequence using sequence searching [11] or classification [12] [13], and taxonomy independent methods (also called de novo clustering) [14], where sequences are grouped into OTUs based on pairwise sequence similarities. However, b. The success of taxonomy dependent methods are limited by the completeness of reference databases [15] since a significant portion of bacteria in a sample belong to unknown taxa which are not recorded in databases, In contrast, de novo clustering methods divide sequences into OTUs without needing any reference database and have become the preferred choice for researchers [16].
The wide variety of de novo clustering methods has been proposed for binning OTUs in the past decades, can be categorized further into i) hierarchical clustering, ii) heuristic clustering, iii) model-based and iv) networkbased methods [17]. Hierarchical clustering methods like mothur [17], HPC-CLUST [19], ESPRIT [20], and mcClust [21] require a distance matrix. This matrix is computed from all sequences pairs after pairwise sequence alignment or a multiple sequence alignment. Then a hierarchical tree is built, and with a predefined threshold, sequences are assigned into OTUs.
On the other hand, network-based methods like M-pick [22] and DMclust [23] by computing all pairwise sequences distances, first construct a fully connected graph and then by modularity community detection, generates OTUs. Therefore, the computational complexity of both hierarchical and network-based methods is O(N2), where N is the number of sequences [17] [23].
Model-based methods, CROP [24] and BEBaC [25] mainly apply some statistical model just like Bayesian model, or a mathematical framework like Gaussian mixture model to describe sequence data. Then based on probability theory, they assign sequences to OTUs. However, they have still a high computational burden [26]. For this reason, hierarchical clustering, model-based and network-based clustering methods, in dealing large-scale sequencing data, quickly meet with the limitations of computational time and memory usage [17].

Materials and methods
In this research work, we employ a novel taxonomy dependent method, where each query sequence is compared against reference taxonomy databases in Greengenes, and SILVA, and assigned to the organism of the best-matched. 16S rRNA gene sequences in seven taxonomic classes in Greengenes, and SILVA 16S rRNA libraries are used to create sample sets to be clustered. From each class at a taxonomy level a number of seeds are randomly selected. Using Longest Common Subsequence Search method, the similarity of query sequence with the seed sequences are calculated. If at least one of the similarities with seeds exceeds a certain threshold, the query is assigned the cluster of seeds.
The Longest Common Subsequence Search method helps us to avoid long sequences of pair wise or globally aligned sequences.

Longest common subsequence search
To find the level of similarity of two gene sequences using Longest Common Subsequence Search method, assume in Figure  The longest common subsequence of (a) and (b) is

GTGTAGAGGTGAAATG
Then we remove this common subsequence from both sequences. Then look for next longest common substring. If there is no longer one this time the string TAGAT may be the second longest common subsequence. It is seen that ten iterations of this process is optimal. Then we add the lengths of these common substrings and normalize by dividing this sum, to the length of the shorter gene.

Inclass and interclass similarities
The average inclass similarities and interclass averages are compared through the analysis of data contained in the high quality ribosomal RNA databases Greengenes, SILVA, and RDP. It is seen that there is a significat difference between in class and inter class similarities for three important taxon levels. Hence this observation shows that longest common sequence similarity measure can be used for both annotation and clustering of unknown samples [27].

Results
Three 16S rRNA libraries are used with 198,510 genes Greengenes, with 801,984 genes, RDP, and with 1,820,420 genes SILVA are used to show the accuracy, sensitivity, and specificity of LCSS clustering technique.
At each taxonomic level, 50 genes are selected from each of 20 classes. These 1000 genes are then shuffled. From each class five seeds are randomly selected. Then the Longest Common Subsequence similarities of seeds to a sample gene (query) are calculated. If any of five seeds is similar to the query gene beyond a threshold, this query is put in the same cluster as these seeds.
Using this technique, 1000 genes are clustered with the Accuracy, Sensitivity, and specificity in Table 4 for all taxonomic classes.

Conclusion
16S rRNA high-throughput sequencing has become a powerful and convenient technology for studying microbial diversity and composition in the environmental samples. Until now, numerous heuristic clustering methods have been developed to pick OTUs, but most of them just select one sequence as the cluster seed, resulting in OTUs overestimation and sensitivity to the sequencing errors. In this work, we proposed a novel similarity clustering method (namely LCSSM).