Annotation of virulence factors in schistosomes for the development of a SchistoVir database

Scientific efforts in the eradication of neglected tropical diseases, such as those caused by the parasitic helminthes, can be improved if a database of key virulence factors directly implicated in pathogenesis is available. As a first step towards creating SchistoVir, a database of virulence protein factors in schistosomes, in this study, we curated, annotated and aligned sequences of twenty virulence factors identified from the literature, using several bioinformatics tools including UniProtKB, SchistoDB, VirulentPred, InterProScan, ProtScale, MotifScan, TDRtarget, SignalP, MODBASE, PDB and MUSCLE. Among the protein entries, the most frequently occurring amino acid residues were lysine, serine, leucine, glutamine, glycine and cysteine in order of magnitude. Although sequence repeat regions (SRRs) of significant value were identified manually in fifty percent of the proteins (while dipeptide repeats (DiPs) and single amino acid repeats (SAARs) were not), nevertheless, seventy-two percent of the protein entries were classified as virulent by the prediction model, VirulentPred. Most of the entries (eighty percent) did not have target compounds based on the database of available chemical compounds at TDRtargets. Fourteen of the twenty entries (seventy percent) had more than 30 consecutively negative amino acid residues based on the ProtScale’s Kyte and Doolittle hydrophobicity plot. Hence, they would be hydrophobic enough to be transmembrane in location or secretory in nature. Only 7 (tyrosinase, serine protease1, Tspan-1, VAL4, cathepsin b and L and calreticulin) had cleavage sites and signal peptides, while none had a significant signal anchor probability. The annotations and characterization provided by this work and the development of a SchistoVir database will aid in further research of schistosome pathogenesis and control.