It occurred to me that a cascading stack of these could be used to bifurcate any data with an inherent vector or bias into subgroups/subsequences in an efficient spectrum fashion.
It popped into my mind that one prominent inherent bias in genomic information is the life-span duration of species. It is a simple matter to arrange a collection of peptide coding genomic data along this spectrum of life-span lengths and train a cascade of SVMs to discern where along this spectrum any block of DNA would fall.
To move this towards actual discovery of novel oligopeptides one would then just need to train an LSTM recurrent neural net on samples of DNA sequences from along this spectrum in order to generate novel oligopeptide which share the probabilistic characters of the original spectrum. The resultant outputs when passed through the above trained SVMs would be scored by where they fell in the spectrum. Any peptides sequences which scored past the longest lived organisms would by definition be from the informational sampling space of an organism with super-spectral lifespan, even if the hypothetical organism doesn't actually exist.
Here's a graphic to help demonstrate the method:

No comments:
Post a Comment