Feature: Help, my favorite gene is missing in the genome!

"Help, the gene I am working with is not present in the genome!"

Users start noticing that some of the genes they are working with cannot be found in the current genome annotation, and they start complaining about it. This is in fact a good approach, because only when problems are reported can we notice them and possibly work on a fix. It is less favorable though that the natural shortcomings of our genome assembly could be also reflected onto the database containing it (e.g. LiceBase), even though the database system and the genome sequence it represents are two related but distinct artifacts (or pieces of "software").

It is indeed very common and not at all surprising for draft genomes and even for very established genomes to contain gaps,
inversions, chimera, and other mis-assemblies. In a very noteworthy and recent Technology Review in Nature [1], Vivien Marx portraits the situation for the human genome as follows:

"For example, the human-genome sequence, used as a reference by scientists around the world, has more than 350 gaps, says Deanna Church, a
genomicist at the US National Center for Biotechnology Information. An updated reference genome is filling in much of the missing data, but “even with the release of the new assembly, there will still be gaps and regions that aren’t well represented,” she says. “It is definitely a work in progress.”"

So, errors, missing genes and wrong sequences are no real surprise - at least for most genomicists -, and modern high-throughput sequencing approaches have possibly alleviated both the availability of genome sequence and the burden of gaps and thereby frustration. In her review Marx further cites a statement by Evan Eichler, a molecular biologist at the University of Washington, how the problem of genes located inside poorly sequenced regions has often been approached by outright negligence:

"More than 900 human genes are in regions where there is much repetition. About half of these genes are in areas so poorly understood that they are often excluded from biomedical study, says Eichler."

In addition, automatic annotation and gene prediction procedures are certainly far from perfect and might result in further genes lacking proper gene models.

How we can help

To the best of our knowledge, the current genome assembly and annotation is the best source of genomic data on Sea Lice that exists. Both the assembly and the annotation have been carried out using state-of-the-art pipelines and have been validated by a large number of different methods using intrinsic and extrinsic data. Unfortunately, we do not have a "magic wand" to make the genome as perfect as it should be such that it matches all validated sequences. However, negligence is not an option in many cases where the gene of interest is highly promising.

Instead we require the contribution of users who are willing to share information about crucial gaps in the sequence and annotation.
This genome is ours and we all need to work jointly on improving it. There are several steps that can contribute to resolve such problems (note how similar the approach is to reporting and fixing software bugs):

  • The first step is to report a potential gap or error. Currently, we do not have a complete overview about even how many interesting genes are lacking. I suspect that it is just a handful but we need more data to find this out.
  • This might be followed by submitting a validated sequence (PCR products, etc.) to LiceBase and also Genbank that helps to search the genome and all other available sequence resources for. This represents a reproducible example that helps to find out what is wrong in the first place (Is it a gap in the genome, a mis-assembly or a lacking gene model in the annotation?)
  • We can then work jointly on a 'patch' for the next genome update, because it will be possible to provide regular updates to Ensembl for the annotation and genome (personal communication with ensembl staff)
  • While this is in progress, everyone can use the submitted validated sequence to reference in their own experiments

Certainly, we need to agree on a valid modus operandi for such updates within the Centre and also with Ensembl. In the meantime, we are providing all necessary reporting, sequence search, and submission tools for potential problems and new sequences to the community. Also, we are willing to provide ample assistance in trying to analyse the nature of missing genes within the annotation.

Please check out the following resources to help identifying your gene or reporting new sequences:

Please contact us: we can provide most advanced bioinformatics assistance in finding your genes of interest in the available genomic data on sea lice.

[1] Marx, Vivien. “Next-generation Sequencing: The Genome Jigsaw.” Nature 501, no. 7466 (September 12, 2013): 263–268. http://dx.doi.org/10.1038/501261a. | http://www.nature.com/nature/journal/v501/n7466/full/501261a.html?WT.ec_...