Trying to decipher the hidden language of (non coding) DNA is like aliens landing on earth knowing nothing about earth or humans, entering a public library, discovering racks and racks of printed text and trying to derive meaning out of it. There is meaning of course. But making sense of it is another matter. Maybe AI will provide that starting point.
Do you want acid-bleeding, face-hugging, stomach-bursting aliens? Because this is how you get acid-bleeding, face-hugging, stomach-bursting aliens.
However, this clearly shows that the human brain beats AI. For man is the one that determines what is needed for the AI to function correctly (and it still falls short on many occasions). The same holds true with this endeavor as well.
Interesting endeavor though I must admit.
“...researchers can attempt to decode the intricate information concealed within our genome.”
Good luck with that. The retards can’t even define what a woman is or which restroom they should use.
My DNA was just decoded it translated into “Be sure to drink your Ovaltine”.
I imagine that this AI model still cannot decipher “noncoding DNA”. This was once classified as “junk”, but now in many cases has been found to have important functions.
https://medlineplus.gov/genetics/understanding/basics/noncodingdna/
Real Frankenstein stuff. They’ll conjure unimaginable horrors.
It is Open Access
DNA language model GROVER learns sequence context in the human genome
Available models for the human genome include LOGO6, DNABERT7 and Nucleotide Transformer (NT), which use a Bidirectional Encoder Representations from Transformers (BERT) architecture and apply different strategies of generating the vocabulary. NT uses mainly 6-mers as its vocabulary. DNABERT uses k-mers of 3, 4, 5 and 6 nucleotides for four different models, of which the 6-mer model performs best. The k-mers overlap, and the training is designed for the central nucleotide of a masked sequence not to overlap with any unmasked tokens. Consequently, the model largely learns the token sequence, rather than the larger context. Semisupervised models include data beyond the genome sequence, such as GeneBERT11. HyenaDNA uses implicit convolutions in its architecture. Taking genomes from multiple species increases the amount of training data, as for DNABERT-2.
We therefore applied byte-pair encoding (BPE) to the human genome to generate multiple frequency-balanced vocabularies and selected the vocabulary that carries the information content of the human genome in an optimal way. In combination with fine-tuning tasks and the inbuilt transparency of the model architecture, we can now start using the resulting foundation DLM, GROVER (Genome Rules Obtained Via Extracted Representations), to extract its learning and different layers of the genome’s information content.
https://www.nature.com/articles/s42256-024-00872-0
Try it here https://huggingface.co/PoetschLab/GROVER
Tutorial GROVER - DNA Language Model https://zenodo.org/records/13135894