Bioinformatic Analysis of Codon Usage and Phylogenetic Relationships in Different Genotypes of the Hepatitis C Virus


Mojtaba Mortazavi 1 , Mohammad Zarenezhad 2 , 3 , Seyed Moayed Alavian 4 , Saeed Gholamzadeh 3 , ** , Abdorrasoul Malekpour 3 , ** , Mohammad Ghorbani 5 , Masoud Torkzadeh Mahani 1 , Safa Lotfi 1 , Ali Fakhrzad 2

1 Department of Biotechnology, Institute of Science and High Technology and Environmental Science, Graduate University of Advanced Technology, Kerman, IR Iran

2 Gastroentrohepatology Research Center, Shiraz University of Medical Sciences, Shiraz, IR Iran

3 Legal Medicine Research Center, Legal Medicine Organization of Iran, Tehran, IR Iran

4 Baqiyatallah Research Center for Gastroenterology and Liver Disease, Baqiyatallah University of Medical Sciences, Tehran, IR Iran

5 Department of Pathology, School of Medicine, Fasa University of Medical Sciences, Fasa, IR Iran

Corresponding Authors:

How to Cite: Mortazavi M, Zarenezhad M, Alavian S M, Gholamzadeh S, Malekpour A, et al. Bioinformatic Analysis of Codon Usage and Phylogenetic Relationships in Different Genotypes of the Hepatitis C Virus, Hepat Mon. 2016 ; 16(10):e39196. doi: 10.5812/hepatmon.39196.


Hepatitis Monthly: 16 (10); e39196
Published Online: September 10, 2016
Article Type: Research Article
Received: May 14, 2016
Revised: July 16, 2016
Accepted: August 31, 2016




Background: The hepatitis C virus (HCV) has six major genotypes. The purpose of this study was to phylogenetically investigate the differences between the genotypes of HCV, and to determine the types of amino acid codon usage in the structure of the virus in order to discover new methods for treatment regimes.

Methods: The codon usage of the six genotypes of the HCV nucleotide sequence was investigated through the online application available on the website Gene Infinity. Also, phylogenetic analysis and the evolutionary relationship of HCV genotypes were analyzed with MEGA 7 software.

Results: The six genotypes of HCV were divided into two groups based on their codon usage properties. In the first group, genotypes 1 and 5 (74.02%), and in the second group, genotypes 2 and 6 (72.43%) were shown to have the most similarity in terms of codon usage. Unlike the results with respect to determining the similarity of codon usage, the phylogenetic analysis showed the closest resemblance and correlation between genotypes 1 and 4. The results also showed that HCV has a GC (guanine-cytosine) abundant genome structure and prefers codons with GC for translation.

Conclusions: Genotypes 1 and 4 demonstrated remarkable similarity in terms of genome sequences and proteins, but surprisingly, in terms of the preferred codons for gene expression, they showed the greatest difference. More studies are therefore needed to confirm the results and select the best approach for treatment of these genotypes based on their codon usage properties.


Hepatitis C Virus Codon Usage Bioinformatic Study Phylogenetic Analysis

Copyright © 2016, Kowsar Corp. This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License ( which permits copy and redistribute the material just in noncommercial usages, provided the original work is properly cited.

1. Background

There are several factors which can cause hepatitis, including certain drugs, chemicals, and infectious agents (1). Different infectious agents’ resulting viruses are involved in the pathogenesis of hepatitis, such as hepatitis viruses A, B, C, D, and, E (2). Among these diseases, hepatitis B and C are considered to be more serious and can become chronic (3, 4). Hepatitis C (HCV) is a viral infection that causes either acute or chronic liver inflammation (5). HCV is from the Flaviviridae family and the hepacivirus genus, and has a single-strand RNA (ribonucleic acid) genome (6). It leads to inflammation of the liver, and is one of the most common causes of liver transplants in the world (7-9). In 70% of cases, the disease becomes chronic; self-improvement may occur in 30% of cases (10). Annually, three to five million people are infected with the virus worldwide, and it is estimated that 170 million people are currently infected with the virus around the world (5). Chronic infection with HCV causes deaths due to decompensated cirrhosis, end-stage liver disease, and hepatocellular carcinoma (11).

HCV has high molecular diversity, six major genotypes (named from 1- 6), and over 70 sub-genotypes named a, b, and c (12). Therapeutic programs usually begin with rapid determination of HCV genotypes, because genotyping influences the duration of treatment and the impact of the sustained virological response (SVR) (13). The genetic code reveals that a high ratio of amino acids are encoded by multiple (two to six) codons, which generally differ only at the third codon’s nucleotide (14, 15). This understanding has led to the identification of some important facts about the virus, as patterns of codon usage vary among species (16). Although each codon is specific to only one amino acid, a single amino acid may be coded by more than one codon. Such groups of codons coding a single amino acid are known as synonymous codons (e.g., there are six synonymous codons of leucine). In total, 18 of the 20 amino acids can be encoded by more than one codon due to variations at the third nucleotide position within a particular codon. Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA (17). Codon usage study can help clarify the evolution of a particular species (14). Recent studies have shown that synonymous codons or the equivalent of an amino acid are not used with the same frequency, and each type of codon usage, in organisms and even between the genes of one organism, is different (18).

As HCV exhibits high genetic diversity, this poses a challenge for the improvement of vaccines and pan-genotypic treatment methods (19). Multiple genotypes and subtypes of HCV have been identified via the analysis of nucleotide sequences (20). Characterization of these genetic properties and the possible differences between these genotypes is likely to facilitate and contribute to the development of effective prevention and treatment protocols against HCV infection (21). Previously, we were the first to have studied rare codon clusters (RCCs) and their locations in structures of HCV proteins (22).

2. Objectives

In this project, a bioinformatic study of different genotypes of HCV was conducted to check the phylogenetical differences between these genotypes, as well as the amino acid codon usage in the structure of the virus. It was hoped that more precise and effective approaches could then be chosen for treatment regimens using the findings of this study.

3. Methods

3.1. HCV Genome Sequences

For the bioinformatic analysis, the nucleotide sequences and features of the six genotypes of HCV were obtained from the following website : (Table 1).

Table 1. Genetic Properties of HCV Genotypes
LocusNC_004102, 9646 bp ss-RNA linear, VRL 17-JUN-2016NC_009823, 9711 bp RNA linear, VRL 26-JUL-2011NC_009824, 9456 bp RNA linear, VRL 27-JUL-2011NC_009825, 9355 bp RNA linear, VRL 26-JUL-2011NC_009826, 9343 bp RNA linear, VRL 26-JUL-2011NC_009827, 9628 bp RNA linear, VRL 26-JUL-2011
VersionNC_004102.1, GI:22129792NC_009823.1, GI:157781212NC_009824.1, GI:157781216NC_009825.1, GI:157781208NC_009826.1, GI:157781210NC_009827.1, GI:157781214
Db_XrefTaxon:11103, GeneID:951475Taxon:40271, GeneID:11027172Taxon:356114, GeneID:11027185Taxon:33745, GeneID:11027168Taxon:33746, GeneID:11027170Taxon:42182, GeneID:11027174
Protein IDNP_671491.1YP_001469630.1YP_001469631.1YP_001469632.1YP_001469633.1YP_001469634.1
Db_XrefGI:22129793, GeneID:951475GI:157781213 GeneID:11027172GI:157781217, GeneID:11027185GI:157781209, GeneID:11027168GI:157781211, GeneID:11027170GI:157781215, GeneID:11027174

3.2. Analysis of Codon Usage

In the next step, the frequency, number, and fraction of 61 codons for each amino acid were evaluated within the structure of HCV proteins, and the preferred codons were extracted using the information provided on the Gene Infinity website: (23) (Table 2).

Table 2. The Nucleotide Compositional Properties of the Six HCV Genotypes
%G1 + C157.3955.7556.6056.0756.4755.81
%G1 + A157.6257.9656.4758.0657.6057.80
%G1 + T151.9453.0252.8652.7851.9652.10
%A1 + T142.6144.2543.4043.9343.5344.19
%A1 + C148.0646.9847.1447.2248.0447.90
%C1 + T142.3842.0443.5341.9442.4042.20
%G2 + C250.6150.3550.4549.2949.7050.15
%G2 + A244.5443.5244.6543.8044.7244.05
%G2 + T249.6248.6048.5948.3548.6448.79
%A2 + T249.3949.6549.5550.7150.3049.85
%A2 + C250.3851.4051.4151.6551.3651.21
%C2 + T255.4656.4855.3556.2055.2855.95
%G3 + C368.5866.2459.9163.0564.7661.08
%G3 + A343.0844.2144.3644.6044.1645.28
%G3 + T347.8647.2549.5547.0948.7448.56
%A3 + T331.4233.7640.0936.9535.2438.92
%A3 + C352.1452.7550.4552.9151.2651.44
%C3 + T356.9255.7955.6455.4055.8454.72
%G3s + C3s67.2064.6058.0861.4863.2959.34

Also, phylogenetic analysis and the evolutionary relationship of HCV genotypes were evaluated using MEGA 7 software (24). The analysis of the deduced amino acid sequences from the collected samples and data obtained from GenBank was performed through the construction of a phylogenetic tree with maximum parsimony using MEGA 7. The frequencies of the used codons were reported as descriptive statistics. The software Minitab version 16.0 was used for statistical analysis (24).

3.3. Compositional Properties Measures

To examine the compositional properties of the six HCV sequences, GC1s,2s,3s, GA1s,2s,3s, GT1s,2s,3s, AT1s,2s,3s, AC1s,2s,3s, and CT1s,2s,3s (the frequencies of nucleotide G + C, G+A, G+T, A+T, A+C, and C+T at the first, second and third codon position) within each open reading frame (ORF) were calculated. This calculation was done using the CAIcal web server (25).

4. Results

4.1. Cluster Codon Analysis

The results of the cluster codon analysis showed that the codon usage for terminal nucleotides of all amino acids included C and G. For example, the amino acids alanine (Ala), glycine (Gly), tyrosine (Tyr), and valine (Val), which each have four codon codes, had reported terminal nucleotides with codon usage of C or G. The results of the cluster codon analysis also showed that genotypes were divided into two groups with 4% similarity: genotypes 1, 5, and 3 in one group, and genotypes 2, 6, and 4 in the other group. In the first group, genotypes 1 and 5 had the highest similarity of codon usage (74.02%), and in the second group, genotypes 2 and 6 showed the highest similarity of codon usage (72.43%). The most differences in codon usage were detected between genotype 1 from the first group and genotype 4 from the second group, with 4% similarity in terms of preferred codons (Figure 1).

Similarity of Codon Usage Between HCV Genotypes
Figure 1. Similarity of Codon Usage Between HCV Genotypes

Phylogenetic analysis of the genotypes showed that closest resemblances were between genotypes 1 and 4 (Figure 2). The close proximity of the genotypes 1 and 4 in the tree diagram represented a similarity in their gene and protein sequence, but codon usage analysis showed that genotypes 1 and 4 had minimal similarity and maximal distance. This phylogenetic analysis also indicated that genotypes 1 and 2 had the most significant phylogenetical distance (Figure 2).

Molecular Evolution and Phylogenetic Diagram of HCV Genotypes
Figure 2. Molecular Evolution and Phylogenetic Diagram of HCV Genotypes

4.2. Compositional Properties of the Genomes in HCV Genotypes

The compositional properties of the genomes of the six HCV genotypes in the CAIcal web server showed that these HCV genotypes have the similar contents of GC1s,2s,3s, GA1s,2s,3s, GT1s,2s,3s, AT1s,2s,3s, AC1s,2s,3s, and CT1s,2s,3s (Table 3). It was found that the frequency of GC1s, 2s, 3s was higher in comparison with other nucleotide compositions. The minimum frequency of nucleotide composition belonged to AT3s. These results showed that HCV is a GC abundant virus.

Table 3. The Frequency, Number, and Fraction of Each of the 61 Codons for Each Amino Acid in the Protein Structure of HCV Genotypes
Terminal CodonTGA1.

4.3. Prevalence of Preferred (Used) Codons

Figure 3 shows the prevalence of the preferred (used) codons in the HCV genotypes. Here, it can be seen which codon is preferred and used more than other codons. The results showed that the most preferred codon usage for all of the amino acids was, in order, as follows: Ala (GCC), Cys (TGC), Asp (GAC), Glu (GAG), Phe (TTC), Gly (GGC), His (CAC), Ile (ATC), Lys (AAG), Leu (CTC), Asn (AAC), Pro (CCC), Gln (CAG), Arg (AGG), Ser (TCC), Thr (ACC), Val (GTG), Tyr (TAC), and the stop codon (TGA-TAG). Also, the least preferred codons for all of the amino acids was, in order, as follows: Ala (GCA), Cys (TGT), Asp (GAT), Glu (GAA), Phe (TTT), Gly (GGA), His (CAT), Ile (ATT), Lys (AAA), Leu (TTA), Asn (AAT), Pro (CCG), Gln (CAA), Arg (CGA), Ser (AGT), Thr (ACG), Val (GTA), Tyr (TAT), and the stop codon (TAA; not used). Met (ATG) and Trp (TGG) had one codon. The results of the cluster codon analysis also showed that the lowest codon usages for terminal nucleotides among all amino acids, with the exception of Met, Trp, Thr, and Pro, were A and T.

Frequency of Used Codons in HCV Genotypes
Figure 3. Frequency of Used Codons in HCV Genotypes

5. Discussion

HCV is the leading causes for chronic liver disease (1, 2), with the possibility of leading to chronic hepatitis and eventually hepatocellular carcinoma (HCC) (26). In addition to the clinical and epidemiological significance of HCV, genotyping has significant prognostic value and can be used to help determine the progress and treatment protocols of the disease (21). The amino acid sequences of proteins are determined by three nucleotide codons. Living organisms use standard genetic codes including 61 codons for 20 amino acids, with some amino acids having more than one codon. The pressure on the translated codons is to prefer (use) some codons rather than others for effective protein expression (27). Changes in the patterns of codon usage can lead to changes in response to the treatment of nucleotide-like drugs. Genotypes that have the greatest differences in codon usage may lead to significant differences in the response to and duration of treatments with the same drug regimens. The reason can be attributed to the pattern of using similar nucleotide codons in these two genotypes.

In this study, the biggest similarities in codon usage were observed between genotypes 1 and 5; therefore, it was expected that the results regarding the dosage and treatment protocol for genotypes 1 and 4 would be reversed. Despite the significant differences in codon usage among genotypes 1 and 4, the two genotypes had the phylogenetically closest resemblances, indicating more similarities in their genome and protein sequences. The most significant phylogenetical difference was observed between genotypes 1 and 2, which indicated that these two genotypes had the greatest difference in terms of the sequences of genomes and protein.

The results of the codon usage analysis showed that some codon usages, such as Gln (CAG, CAA), Ser (AGC), and Trp (TGG), had very similar frequencies in all of the HCV genotypes. This result is very important, as these residues may have a critical role in determining the final structure of the HCV proteins. However, it is essential to confirm this conclusion with more experimental evidence.

As the results of this study showed, the most preferred terminal nucleotides in codon usage for all of the amino acids were C and G. Consequently, the least preferred terminal nucleotides in codon usage for all of the amino acids were T and A. This is a very important finding, and as previously reported, an additional layer of hidden information lies within the codon sequence and beyond the amino acid sequence (28). Studies of such hidden information in codon sequences can reveal the molecular evolution of the organisms, and provide insights into the functional categories and histories of the genes in the respective genome. Codon usage analysis can also contribute to understanding the interaction between RNA viruses and the immune responses of the hosts (29). These findings showed that all of the transfer RNAs (tRNA) had C and G in the first nucleotides for anti-codon usage among all of the amino acids and, consequently, codon-anti-codon interaction in messenger RNA (mRNA) translation would be very strong. As a result, the average binding energy in codon-anti-codon interaction in hepatitis C is more than that with human cell interaction with HCV, and the mRNA and tRNA translation is stronger here than among similar human cell components (30). Based on the nucleotide structure of the codons, different used codons have special interactive affinity to anti-codons, and this thus leads to different powers of translation. Used codons that have C and G nucleotides in their structures have more energy in their affinity to anti-codons. The exact calculation of this energy can help us to better understand the mechanisms of successful HCV replication and pathogenicity.

In this study, we were able to detect a layer of hidden information within the codon sequences of HCV genomes. Here, we report these findings for the first time, and we believe that they are very critical for planning new research projects and designing new drugs that will influence codon-anti-codon interaction. The findings of such bioinformatic studies can be used for further practical research and clinical trials, and help us establish a better understanding of HCV replication and pathogenesis. Such an analysis conducted on other viral agents of hepatitis could also provide new insights in the field of viral behavior.




  • 1.

    Lauer GM, Walker BD. Hepatitis C virus infection. N Engl J Med. 2001; 345(1) : 41 -52 [DOI][PubMed]

  • 2.

    Feinstone SM, Kapikian AZ, Purcell RH, Alter HJ, Holland PV. Transfusion-associated hepatitis not due to viral hepatitis type A or B. N Engl J Med. 1975; 292(15) : 767 -70 [DOI][PubMed]

  • 3.

    Vaudin M, Wolstenholme AJ, Tsiquaye KN, Zuckerman AJ, Harrison TJ. The complete nucleotide sequence of the genome of a hepatitis B virus isolated from a naturally infected chimpanzee. J Gen Virol. 1988; 69 ( Pt 6) : 1383 -9 [DOI][PubMed]

  • 4.

    Simmonds P, Holmes EC, Cha TA, Chan SW, McOmish F, Irvine B, et al. Classification of hepatitis C virus into six major genotypes and a series of subtypes by phylogenetic analysis of the NS-5 region. J Gen Virol. 1993; 74 ( Pt 11) : 2391 -9 [DOI][PubMed]

  • 5.

    Gower E, Estes C, Blach S, Razavi-Shearer K, Razavi H. Global epidemiology and genotype distribution of the hepatitis C virus infection. J Hepatol. 2014; 61(1 Suppl) -57 [DOI][PubMed]

  • 6.

    Chambers TJ, Hahn CS, Galler R, Rice CM. Flavivirus genome organization, expression, and replication. Annu Rev Microbiol. 1990; 44 : 649 -88 [DOI][PubMed]

  • 7.

    Esteban R. Epidemiology of hepatitis C virus infection. J Hepatol. 1993; 17 Suppl 3 -71 [PubMed]

  • 8.

    de Oliveria Andrade LJ, D'Oliveira A, Melo RC, De Souza EC, Costa Silva CA, Parana R. Association between hepatitis C and hepatocellular carcinoma. J Glob Infect Dis. 2009; 1(1) : 33 -7 [DOI][PubMed]

  • 9.

    Parkin DM, Bray F, Ferlay J, Pisani P. Global cancer statistics, 2002. CA Cancer J Clin. 2005; 55(2) : 74 -108 [PubMed]

  • 10.

    Alberti A, Chemello L, Benvegnu L. Natural history of hepatitis C. J Hepatol. 1999; 31 Suppl 1 : 17 -24 [PubMed]

  • 11.

    Peters MG. End-stage liver disease in HIV disease. Top HIV Med. 2009; 17(4) : 124 -8 [PubMed]

  • 12.

    Norder H, Courouce AM, Magnius LO. Complete genomes, phylogenetic relatedness, and structural proteins of six strains of the hepatitis B virus, four of which represent two new genotypes. Virology. 1994; 198(2) : 489 -503 [DOI][PubMed]

  • 13.

    Roque-Afonso AM, Ducoulombier D, Di Liberto G, Kara R, Gigou M, Dussaix E, et al. Compartmentalization of hepatitis C virus genotypes between plasma and peripheral blood mononuclear cells. J Virol. 2005; 79(10) : 6349 -57 [DOI][PubMed]

  • 14.

    Sharp PM, Emery LR, Zeng K. Forces that influence the evolution of codon bias. Philos Trans R Soc Lond B Biol Sci. 2010; 365(1544) : 1203 -12 [DOI][PubMed]

  • 15.

    Nirenberg MW, Matthaei JH, Jones OW, Martin RG, Barondes SH. Approximation of genetic code via cell-free protein synthesis directed by template RNA. Fed Proc. 1963; 22 : 55 -61 [PubMed]

  • 16.

    Grantham R, Gautier C, Gouy M, Mercier R, Pave A. Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 1980; 8(1) -62 [PubMed]

  • 17.

    Lloyd AT, Sharp PM. Evolution of codon usage patterns: the extent and nature of divergence between Candida albicans and Saccharomyces cerevisiae. Nucleic Acids Res. 1992; 20(20) : 5289 -95 [PubMed]

  • 18.

    Ikemura T. Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol. 1985; 2(1) : 13 -34 [PubMed]

  • 19.

    Shepard CW, Finelli L, Alter MJ. Global epidemiology of hepatitis C virus infection. Lancet Infect Dis. 2005; 5(9) : 558 -67 [DOI][PubMed]

  • 20.

    Smith DB, Bukh J, Kuiken C, Muerhoff AS, Rice CM, Stapleton JT, et al. Expanded classification of hepatitis C virus into 7 genotypes and 67 subtypes: updated criteria and genotype assignment web resource. Hepatology. 2014; 59(1) : 318 -27 [DOI][PubMed]

  • 21.

    Zein NN. Clinical significance of hepatitis C virus genotypes. Clin Microbiol Rev. 2000; 13(2) : 223 -35 [PubMed]

  • 22.

    Fattahi M, Malekpour A, Mortazavi M, Safarpour A, Naseri N. The characteristics of rare codon clusters in the genome and proteins of hepatitis C virus; a bioinformatics look. Middle East J Dig Dis. 2014; 6(4) : 214 -27 [PubMed]

  • 23.

    Stothard P. The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques. 2000; 28(6) : 1102 [PubMed]

  • 24.

    Minitab I. MINITAB statistical software. Minitab Release. 2000; 13

  • 25.

    Puigbo P, Bravo IG, Garcia-Vallve S. CAIcal: a combined set of tools to assess codon usage adaptation. Biol Direct. 2008; 3 : 38 [DOI][PubMed]

  • 26.

    Fattovich G, Stroffolini T, Zagni I, Donato F. Hepatocellular carcinoma in cirrhosis: incidence and risk factors. Gastroenterology. 2004; 127(5 Suppl 1) -50 [PubMed]

  • 27.

    Bennetzen JL, Hall BD. Codon selection in yeast. J Biol Chem. 1982; 257(6) : 3026 -31 [PubMed]

  • 28.

    Chartier M, Gaudreault F, Najmanovich R. Large-scale analysis of conserved rare codon clusters suggests an involvement in co-translational molecular recognition events. Bioinformatics. 2012; 28(11) : 1438 -45 [DOI][PubMed]

  • 29.

    Belalov IS, Lukashev AN. Causes and implications of codon usage bias in RNA viruses. PLoS One. 2013; 8(2)[DOI][PubMed]

  • 30.

    Allner O, Nilsson L. Nucleotide modifications and tRNA anticodon-mRNA codon interactions on the ribosome. RNA. 2011; 17(12) : 2177 -88 [DOI][PubMed]