SARS-CoV-2 SRA sequences search

What is it?

This service allows to search for the presence and abundance of any sequence within publicly available raw next-generation sequencing datasets (and genomes) of SARS-CoV-2.


Query sequence Origin Remarks
TCAAATTGGATGACAAAGATCCAAATTTCAA NC_045512v2:29283-29313 Just a chunk of the SARS-CoV-2 genome, illustrating that most datasets indeed have it at high abundances
AAAAAAAAAAAAAAAAAAAAAAAA a poly-A tail Likely to be found in RNA-Seq datasets
CTTTATCAGGATGTTAACTGC NC_045512v2:23403 famous variant site More info
CTTTATCAGGATGTTAACTGC NC_045512v2:23405 NOT a famous variant site More info
CTTTATCAGGATGTTAACTGC NC_045512v2:23406 also NOT a famous variant site
GAAGGTCTTAATGACAACCTT NC_045512v2:1605-1607 famous deletion site More info

Technical details

Sequences searches are performed as follows: the query sequence is broken down into all its overlapping 21-mers, and if any of those 21-kmers is absent from a dataset, the whole sequence is reported as absent in that dataset.
Otherwise, it is considered to be present and for raw sequencing datasets, we report the median abundance across all 21-mers of the query in the dataset. For genomes, we only report the presence/absence of the query.

Thus, sequences searches are exact in the sense that they allow for no mutations between a query sequence and matching sequences in datasets.
However, this is not the same as doing a 'grep': a query is essentially seen as an unordered, de-duplicated set of 21-mers. E.g. for the long polyA sample query, since all the constituent 21-mers are equal, the query is performed as if it was only a single 21-mer (disregarding the original query length).


The technology behind this service is REINDEER (pre-print).
