Finding SARS-CoV- 2 variants

Michael_Okoko · January 31, 2021, 5:58pm

It is no longer news that highly transmissible variants of SARS-COV- 2 or hCoV-19 are showing up in different parts of the world. Genome surveillance of hCoV-19 allows researchers and health agencies keep track of the molecular changes responsible for notable features associated with its notorious variants.

From some papers I have read, it appears NCBI BLAST can be used for genome analysis of the virus to detect variants, but I don’t know how to do this. It would be great if someone showed how NCBI BLAST can be used to detect these variants, since BLAST is in the public domain and quite easy to use (not saying I am pro).

@Mercer, @glipsnort.

swamidass · January 31, 2021, 9:30pm

That’s like asking,

I’ve heard that hammers are used to build houses. Can you give me a quick description of how to build a house with a hammer?

Michael_Okoko · January 31, 2021, 10:04pm

I see. So what exactly is BLAST used for in genome analysis of SARS-CoV-2?

RonSewell · January 31, 2021, 10:18pm

You might like this.

Michael_Okoko · February 1, 2021, 10:58am

Thanks Ron, this is a good learning resource. If I am interpreting @swamidass earlier comment right, it seems BLAST is not sufficient to identify these mutations suspected to be responsible for the increased transmissibility of some hCoV-19 variants. However, we can look at nucleotide or amino acid substitutions across paired or multiple alignments when BLASTing. Isn’t that enough to computationally identify sequence differences before experimentally testing to see if those differences are actually responsible for whatever observations being made?

RonSewell · February 1, 2021, 4:23pm

This is new to me as well, so I should defer further comment to those with further expertise. For better or worse, kids are presenting this sort of thing in science fair projects, so the tools are accessible.

T_aquaticus · February 1, 2021, 4:29pm

Upfront warning . . . I have only recently dipped my toe into the bioinformatics ocean, so don’t take my words as gospel.

There are tons of different software packages out there for working with sequence data. If you feel adventurous, I would suggest creating a Linux boot-up on your computer so you can install UGENE which is a free bioinformatics software package. I haven’t looked, but I would assume there are tons of tutorials on the web that could teach you how to use UGENE with real data. I would also assume that there is a variant track somewhere online that will allow you to visualize where all the known mutations are within the SARS genome.

Having a Linux boot on your computer could also allow you to go down the command line rabbit hole of bioinformatics which will scratch every geek itch you may have. There is something oddly satisfying about pasting in a massive command line into a terminal window and getting back a tab delimited file that has all the info you need. Bioinformatics software is a bit intimidating at first, but it is very satisfying when you start to understand it.

As to BLAST, it isn’t widely used for discovering variants. BLAST is used for things like annotating genomes and trying to identify which species a stretch of DNA/RNA or protein came from.

Michael_Okoko · February 1, 2021, 5:06pm

Command lines scare me. I prefer the lights and colors of interactive user interfaces.

If it can be used, do you know how to?

I know this too, but I just want to know whether some of variant tracking can be done with BLAST (since its the only bioinformatics tool I am quite comfortable with).

davecarlson · February 1, 2021, 5:07pm

I don’t have time this week, but if somebody reminds me, next week I can probably provide a brief visual tutorial of the steps that are typically taken when calling variants from population sequencing data. I do that sort of thing pretty regularly.

Caveats:

I’m not completely certain that the steps I use are same steps that folks working with COVID19 sequencing data use, though the principles should be similar
When people refer to COVID19 variants, they don’t seem to be talking specifically about individual mutations so much as novel haplotypes that might confer some effect of note. The specific definition of “variant” in this context is a little unclear to me, and I haven’t really looked into it.

Michael_Okoko · February 1, 2021, 5:22pm

That would be great. I will definitely remind you by tagging you here next week.

Variants in this context would be the ones making the headlines like the B.1.1.7 variant with over twenty mutations that make its spike protein bind ACE2 receptors more readily.

Michael_Okoko · February 1, 2021, 5:31pm

Take a look at this paper wherein the details of a variant analyses of collected hCoV-19 sequences are documented. I now fully feel the full weight of @swamidass comment on the inadequacy of BLAST for variant analyses.

@davecarlson is your approach similar to what was done in the paper?

davecarlson · February 1, 2021, 6:00pm

Not exactly. Because the genome of SARS-CoV2 is very small and haploid, it’s (relatively) straightforward to sequence and assemble the whole thing and work with the full length sequence of each sample you’ve sequenced. In this case, they downloaded 1,040 pre-assembled genome sequences, built a multiple sequence alignment, and then identified variants sites using a minimum allele frequency threshold of 1% (a variant has to be found in 1% or more of the sequences in order to be identified).

It’s quite a bit more work when dealing with larger genomes. For a variety of reasons (that are slowly becoming less true over time as technology improves), it’s generally not practical to assemble the full genome for every individual you’ve sequenced. Instead you rely on a small number of reference genome assemblies for your species that have already been generated (typically it’s just a single reference). Then you sequence the genomes of your population sample, but you first break up the genome into much smaller, overlapping fragments and then sequence part or all of each fragment. These sequenced bits are called “reads” because they are what is “read” by the sequencing machine.

By sequencing each of the smaller reads, you ideally end up sequencing every position of your genome multiple times–perhaps dozens or even hundreds of times, depending on how much DNA you have and how much money you want to spend!.

You can take each of those reads and align them against your reference genome in a manner that is somewhat analogous to a blast search, though the details differ. Then you run various software that examine each position in the genome assembly and attempts to build a statistical model of the genotype (homozygous for the reference allele, heterozygous, or homozgyous for an alternative allele) at each position based on the alignment of the reads. The more reads that successfully align to a particular position, the greater the ability of the algorithm to reliably identify any variation that might be present at that position.What you’re typically left with is a list of all the positions in the genome that were successfully genotyped in one or more individuals in your sample and what the genotype was.

All of the above is a pretty gross oversimplification that neglects all kinds of important and pesky details, but I hope you get the gist!

Michael_Okoko · February 1, 2021, 7:05pm

I certainly did. Hoping to get some visuals next week.

davecarlson · February 1, 2021, 7:29pm

Great, please ping me next week as I will almost certainly forget!

By the way:

Command lines scare me. I prefer the lights and colors of interactive user interfaces.

If you are interested in learning bioinformatics, I would strongly recommend increasing your comfort level with the command line. That said, there are online resources that build a graphical interface on top of some widely used command line tools.
For example, I believe you could create a free account with Galaxy and/or Cyverse, and use those resources to run bioinformatics analyses. CIPRES may also be useful for phylogenetics, though they seemed to have heavily restricted free usage for non-US users.

Here is a slightly-outdated tutorial for doing variant calling with Cyverse: https://cyverse.atlassian.net/wiki/spaces/TUT/pages/258736284/Discover+Variants+Using+SAM+Tools+Workflow+Tutorial#DiscoverVariantsUsingSAMTools(WorkflowTutorial)-Operation3:IdentifyVariants(app:CallingSNPsINDELswithSAMtoolsBCFtools)

Michael_Okoko · February 6, 2021, 1:13am

Hello @davecarlson.

Reminder on the visual demonstration of finding genomic variants for SARS-CoV-2.

swamidass · February 8, 2021, 8:06pm

6 posts were split to a new topic: A Basic Tutorial on Analyzing the Coronavirus Genome

Topic		Replies	Views
A Basic Tutorial on Analyzing the Coronavirus Genome Conversation Science	32	835	March 19, 2021
COVID-19 genome and design detection Conversation Design	33	3259	June 9, 2020
CDC on Emerging SARS-CoV-2 Variants Conversation Science	1	221	January 11, 2021
Paper on the Spread of COVID-19 in the Boston Area Conversation Science	8	508	December 19, 2020
Lessons from the pandemic: A new look at an new virus: patterns of mutation accumulation in SARS-CoV-2 since 2019 Conversation Science , Design	124	2517	August 29, 2022

Finding SARS-CoV- 2 variants

Related topics