I got the sequence of a DNA fragment from an unknown gene and tried to find the full gene sequence as well as its predicted protein sequence from that fragment.
Initially I did a BLASTN search using the DNA fragment sequence as my query. The best hit was fully identical to my query sequence and its description indicated it was a “complete sequence”. Based on this I concluded my DNA fragment was indeed the complete sequence. Next I ran a BLASTX search to find database matches to all six reading frames of the translated nucleotide query in order to find its predicted protein sequence. BLASTX returned many matches from organisms distinct from the organism with the best hit from the earlier BLASTN search. I ignored them and focused on just matches from the organism with the best scoring hit from my earlier BLASTN search. However, there were still multiple, different matches and that left me confused.
[I then turned to Uniprot BLAST (which actually does a BLASTX search) to search for the protein sequence and it returned just one entry which left me puzzled]. I went back to my NCBI BLASTX results and searched for that particular hit. Lo and behold, I found it there too.
[Did Uniprot BLAST(X) return that single entry (unlike NCBI BLASTX) because it searched its database and found that particular entry among other manually curated (it had a golden logo) ones?]
The DNA fragment sequence is given below and was taken from the Bacterial Genomes 1: From Sequence to Function course created by the Wellcome Connecting Science and organized by Futurelearn:
PostScript: this OP contains some errors and this was noted down the thread. Statements or questions in square brackets are errors or based on error.
The default parameters (like E-values and percent identity) are slightly different but that is irrelevant. NCBI BLASTX returned multiple entries for a single organism, but Uniprot BLASTX returned just one entry even though, like NCBI BLASTX, it converted the query nucleotide sequence into all six possible reading frames and compared them with the protein sequences in the database. I think I know why that was the case, as I said in the OP, but I want to know if others agree with me or there is a different explanation.
@Michael_Okoko, I am not sure how much I should give away here because I don’t know what you are trying to do. However, the sequence you gave overlaps two adjacent protein-coding genes in your organism’s genome. I suspect that the varying BLASTX results you are getting has to do with this - the two databases may have different complements of coding regions, and your sequence will flag the two different proteins from the same genome.
This is the NCBI BLASTX alignment page showing one entry which is similar to the Uniprot entry (indicated by the red arrow) and a different entry (there are other different entries but I included just this one).
OK. That helps. I hope you don’t mind if the following is a bit cryptic. I will try to take you through what I did, to an extent.
The first thing I did was, as you did, run BLASTN using the nr database. As you found, the best hit was to a sequence that was noted as complete. However, this only denoted that the entry in the database was complete. If you look at the BLASTN alignments, you will see large coordinate numbers (3649425-3650063). Drawing on my experience (I am, after all, a graybeard…), I reasoned that the database entry you retrieved was a genome entry, probably a complete bacterial chromosome. (This explains the name of the match - … chromosome 1, complete sequence, Length=3985223)
I also ran BLASTX using the nr protein database and got the following (split into two so you can see the two relevant sets of matches):
You should be able to see that your entry (PStest in the first figure) splits into two sections, that correspond to the two different proteins. I figured from this that your sequence probably matched two different genes in your organisms’s genome, that happened to lie next to each other.
@Michael_Okoko, I will stop here to give you a chance to figure out some follow-up analyses.
One more thing:
I am not so familiar with the Uniprot databases. I am guessing that they draw upon different sets of data than do NCBI. But I honestly cannot explain this.
No matter, I took a different direction once I saw the results I show above.
I went back and re-ran Uniprot BLASTX and to my surprise it provided many entries just like NCBI BLASTX. However, the entries were arranged such that the first entry had the highest percent identity compared with my translated query nt sequence and the last entry had the lowest percent identity.
The translated query nt sequence was fully identical to the Autotransporter adhesin BpaC protein sequence of Burkholderia mallei and nearly fully identical to the homolog protein sequence from Burkholderia pseudomallei (an entry too). I think this suggests that the DNA fragment is bpaC (which is transcribed and translated to BpaC) shown in the Uniprot entry.
To take you through my own thought process: your BLASTN results take you to some specific coordinates in a bacterial genome. If the genome is annotated, you can go right to the genome (track it down and get the Genbank entry at NCBI), follow the genome coordinates to the appropriate location, and then get the complete protein sequences and annotations (if they exist) for both genes. You can then explore them further, using BLASTP or other tools.
But we don’t get one bacterial genome entry, but multiple entries from the genomes of different bacteria when we do an NCBI BLASTN search for that query nt sequence and one of the NCBI BLASTX entries clearly indicates the translated query nt sequence is partial or incomplete.
Uniprot BLAST seems to be better equipped to figure out the full gene sequence and a corresponding predicted protein sequence. Found this on their page: