Home

Awesome

WWDC 2020: Coronavirus Comparison

Efficiently comparing the 2019 coronavirus genome with a couple of other coronaviruses in Swift. This project was selected as a WWDC 2020 Swift Student Challenge winner!

Youtube demo: here.

Setup

You can run this project by cloning this repository and opening the Xcode playground WWDC2020.playground in Xcode 11.3.

Motivation

The 2019 coronavirus (SARS-CoV-2) pandemic is a major event in this new decade. There are millions of cases world-wide caused by this highly contagious coronavirus disease (COVID-2019). Therefore, understanding this virus is very important. In this WWDC project, I compare the 2019 coronavirus to a couple of other similar viruses by looking at their sequenced RNA genomes.

Overview

First, let us define the "base virus" as a sample of the 2019 coronavirus from a human host in Illinois. This project allows us to visualize the relationship between this base virus and a couple of other samples that were sequenced by researchers. These sequences were available in FASTA format and they were retrieved from the National Center for Biotechnology Information's GenBank database:

The raw FASTA files are available in the WWDC2020.playground/Resources/ directory.

The Challenge

With coronavirus genomes reaching up to almost 30,000 nucleotides in length, aligning them will take a significant amount of time. With the basic dynamic programming edit distance algorithm to find the similarities and differences between two genomes, this takes around 900 million operations and storing 900 million numbers in an array (repeated 6 times for comparing 6 different virus genomes!). This is way too inefficient for a ~3 minute demo. Much of this project is dedicated to push Swift to the limit and calculate the alignments in real time.

Algorithm

The edit distance or "Levenshtein distance" between two sequences is the number of gaps (sometimes called insertions or deletions) and mismatches between those two sequences. The actual edit operations (gaps and mismatches) make up the alignment of those two sequences. The basic Levenstein distance algorithm for calculating the edit distance value is a simple recursion, which can be expressed as

dp[i][j] = dp[i - 1][j - 1], if A[i - 1] == B[j - 1]
           max of {dp[i - 1][j - 1] + 1, dp[i][j - 1] + 1, dp[i - 1][j] + 1}, otherwise (A[i - 1] != B[j - 1])

for two strings A and B of length n and m using hand-wavy pseudocode. Special care must be taken for the first row/column. dp[n][m] is the Levenshtein distance between A and B. Of course, memoization can be used (as the name dp suggests, dynamic programming) to get a time complexity of O(mn). This is not good enough for the ~30,000 nucleotide coronavirus genomes. Therefore, many interesting tricks for my implementation of this algorithm:

Limitations

For a serious analysis please use a well-established tool like BLAST. This project is merely a demostration of the algorithms and a visualization of the differences between coronavirus genomes.