Non-coding RNA analysis

Background

A non-coding RNA (ncRNA) is an RNA molecule that is not translated into a protein. The number of ncRNAs present in genomes is unknown but recent studies suggest the existence of thousands of ncRNAs. Most of the identified ncRNAs have not been validated for their function. It is possible that many of them are non-functional. However, it is known that some types of ncRNAs, such as microRNAs, play critical roles in many physiological processes and that their dysregulations are closely related to the development and progression of various human diseases, including cancer. In-silico processing can contribute to a better understanding of the functions of these molecules.

Research rationale

The idea is to use computational techniques to enhance our knowledge about ncRNAs. We focus our research on the prediction of miRNA-disease associations and the study of triplex topology.

Prediction of miRNA-disease associations

MicroRNAs (miRNAs) are a class of 19–24 nucleotide single-stranded non-coding RNAs that can regulate gene expression at the post-transcriptional level by binding with 3′​ untranslated regions (UTRs) of the target mRNAs through base pairing. Every miRNA might regulate from a dozen to thousands of genes, and one target gene could also be regulated by hundreds of miRNAs. These miRNA-mRNA interactions play critical roles in many physiological processes, such as development, apoptosis, differentiation and metabolism. Additionally, miRNA dysregulations are closely related to the development and progression of various human diseases, including cancer. The amount of data gathered on this topic is still relatively low at the current time. To address this limit, mathematical modeling is an appropriate solution because it allows us to focus research and development effort on high-potential new associations of miRNA with diseases.

We developed a method, called MiRAI [1], that uses distributional semantics to reveal new information that is attached to miRNAs and diseases. Our basic approach represents distributional information on miRNAs and diseases in a high-dimensional vector space and defines the associations between miRNAs and diseases in terms of vector similarities. The distributional hypothesis is the basis of statistical natural language processing; it states that the meaning of words can be determined by the context in which the words are used. In our case study, we can consider that miRNAs represent words, and therefore, data associated with miRNAs play the role of context. By keeping the analogy, the goal is then to use the available information on the miRNAs as a whole (context) to infer new knowledge regarding the miRNAs (words). A vector space model is an algebraic model for representing objects as vectors. The principle is that each component of a vector is represented by a value (a weight) that should quantify the importance of a feature in the modelized object. Whole text studies use techniques such as Latent Semantic Analysis (LSA) to process vectors from a vector space model. For textual data, there are many ways to calculate the weights. Among these methods, variations of the popular term frequency–inverse document frequency weighting scheme (TF-IDF), which involves multiplying the Inverse Document Frequency measure by a Term Frequency measure are frequently used. In our scenario, the vectors that represent the objects to be analyzed (miRNAs) are not homogeneous. In addition to associations between miRNAs and words extracted from plain text documents, we must also represent the relationships beween the miRNAs and diseases, targets, families or neighboring miRNAs. Links between miRNAs and diseases, targets or families consist of binary information that depends on the existence or not of an association. Neighboring relationships between miRNAs are expressed as integers that represent a distance along a chromosome in base pairs. For textual data, for which LSA is well suited, we applied the TF-IDF weighting scheme. For other data, it was necessary to determine a weighting scheme that could represent the importance of the link between each miRNA and the associated data.

Our work allowed associating new miRNAs with diseases. For diseases where the research efforts on miRNAs are less important, we suggest lists of putative candidates to target as a priority in future laboratory experiments. Some strange results were observed, which highlighted putative false associations that are stored in databases. During cancer development, the chronic activations of the inflammatory response influence the pattern of circulating miRNAs. Our method could provide a solution to this difficulty by highlighting the components that belong to this noise.

Study of triplex topology.

Double-stranded DNA is able to form triple-helical structures by accommodating a third nucleotide strand. The ‘triplex-forming oligonucleotide’ can be a short sequence of RNA that binds the major groove of the targeted duplex only when this latter presents a sequence of purine or pyrimidine bases in one or the other DNA strand. Many nuclear proteins are known to bind triplex but their biological functions are still understudied.

We identified the sequences capable to engage in ‘triplex-forming oligonucleotide’ in both lncRNA and mRNA collection of Drosophila melanogaster. These motifs were matched against Drosophila genome in order to identify putative sequences of triplex formation in intergenic regions, promoters and introns/exons. The Gene Ontology analysis was used to test the GO term enrichment of genes associated to these sequence lists. Significant enriched gene networks among the lists are linked to potential perfect triplex according to the Hoogsteen pairing rules. Potential triplex structures that are embedded in Drosophila genome networks appear to localize selectively in gene networks that involve embryogenesis and developmental biology of larvae. Within these networks the triplex structures seem to target neurogenesis at high frequency. The triplex mapping analysis on full genome scale suggests a massive epigenetic regulatory process of gene networks.

References

1. Pasquier C, Sanhes J, Flouvat F, Selmaoui-Folcher N: Frequent pattern mining in attributed trees: algorithms and applications. Knowledge and Information Systems 2016, 46:491–514.