Cours BIF7002 (Séminaire de Bioinformatique)

Hiver 2022

Informations pratiques

Enseignant : Vladimir Makarenkov.

Locaux et horaires : mardi de 17h30 à 20h30 sur Zoom (voir vos courriels de l’UQAM pour le lien Zoom).

Page web du cours : http://www.info2.uqam.ca/~makarenkov_v/BIF7002/BIF7002.html.

Courriel : makarenkov.vladimir(at)uqam.ca

Fonctionnement du cours

Ce cours sera basé sur des conférences données par des chercheurs dans les disciplines touchant à la bioinformatique : informatique, mathématiques, biologie et biochimie. Le cours comprendra les conférences (voir le calendrier ci-dessous) suivies d’une séance d'exposés par les étudiants.

Évaluation

L'évaluation comportera trois parties : une note de présence et de participation (20%), un rapport sur une conférence (40%) et un exposé (40%).

La note de présence et de participation sera basée sur l'assiduité au cours et sur l'animation (questions ou discussions pertinentes, ...). Elle comptera pour 20% de la note finale.

Après chaque conférence, une équipe d'étudiants désignée sera chargée de préparer un rapport d'une dizaine de pages, à remettre au plus tard trois semaines après la conférence. Ce rapport, qui devra aussi être remis sous la forme d'une page Web, sera évalué à la base des critères suivants : la qualité de la rédaction, la maîtrise des aspects scientifiques du problème, l'apport original (approfondissement des questions soulevées lors des conférences notamment, présentation et critiques de résultats expérimentaux, etc). Il comptera pour 40% de la note finale.

Lors de la dernière séance de la session, chaque équipe d'étudiants effectuera une présentation orale, d'une vingtaine de minutes, de son rapport, qui comptera pour 40% de la note finale. Les principaux aspects pris en compte dans la notation seront la qualité pédagogique et scientifique de l'exposé.

Vous pouvez consulter des exemples des rapports sur la page web suivante :

http://www.info2.uqam.ca/~makarenkov_v/BIF7002/BIF7002_exemples_rapports.html.

Calendrier

Mardi 11 janvier

Présentation du cours BIF7002, sélection des conférences.

Mardi 18 janvier

1) Présentation sur les stages du DESS par l’agent de stages

2) Vladimir Makarenkov (Professeur, Département d'informatique, UQAM)

Titre : Le criblage à haut débit : détection et élimination efficaces du biais systématique

Résumé : Le criblage à haut débit (HTS - High-Throughput Screening, en anglais) est une technologie moderne de recherche de nouveaux médicaments. La procédure de criblage doit être largement automatisée pour pouvoir être applicable (plus de 100 000 composés chimiques sont souvent analysés par jour). La qualité des mesures est primordiale pour la recherche de composés prometteurs (i.e., hits), qui sont des candidats éventuels pour devenir de nouveaux médicaments. Lors de la prise des mesures, plusieurs biais, aléatoires ou systématiques, peuvent se produire. Ils peuvent être dus à des erreurs de manipulation, à des capteurs défectueux, au vieillissement des composés, etc. Les méthodes que nous avons proposées, appelées Background correction et Well correction, cherchent à corriger le biais systématique pour diminuer son impact sur les mesures expérimentales. Nous avons créé le logiciel HTS Corrector qui implémente ces méthodes et présente les résultats de manière chiffrée et graphique pour mieux visualiser les effets de biais systématiques. Divers essais des méthodes proposées ont été réalisés sur des données réelles et simulées en vue de prouver leur efficacité.

Présentation de Vladimir Makarenkov + le tableau blanc

Article de Malo et al. 2006 (Nature Biotechnology)

Article de Caraus et. 2015 (Briefings in Bioinformatics)

Le logiciel HTS Corrector (Makarenkov et al. 2006, Bioinformatics)

Mardi 25 janvier

Jocelyn Bédard (Département d'informatique, UQAM)

Titre : Investigation of the effect of copy number variants on the expression of genes and the IQ (DESS project in Prof. Sebastien Jacquemont’s group at the CHSJ research center)

Résumé : Copy number variants (CNVs) have previously been found to be linked to many human diseases and disabilities (e.g.. Autism, Down syndrome). They have also been shown to be linked to a negative effect on carriers’ intellectual capacity perceived as a reduced IQ. Since CNVs can be expected to have various effects on gene expression, we aim to determine if CNVs affect IQ at least in part due to an effect on gene expression. For this purpose, we used the CARTaGENE cohort in which cognitive, genotypic and transcriptomic data is available for several individuals. The cognitive data was used as an indicator of the IQ, the genotypic data allowed us to identify the CNVs present and the transcriptomic data was used to assess gene expression levels. My tasks were mainly focused on analysis of the cognitive data and the transcriptomic data. The latter required filtering, normalization and correction for batch effects before we could use the data in our study. I will present the result of several statistical methods used to assess any correlation between gene expression and the IQ as well as the effect of CNVs on gene expression. In general, some important genes were upregulated or downregulated to draw significant conclusions.

Présentation de Jocelyn Bédard

Rapport de Lei Cao

Mardi 1 février

Denis Tverskoi (National Institute for Mathematical and Biological Synthesis, University of Tennessee)

Titre : The evolution of germ-soma specialization under different genetic and environmental effects

Résumé : Division of labor exists at different levels of biological organization - from cell colonies to human societies. One of the simplest examples of the division of labor in multicellular organisms is germ-soma specialization, which plays a key role in the evolution of organismal complexity. Here we formulate and study a general mathematical model exploring the emergence of germ-soma specialization in colonies of cells. We consider a finite population of colonies competing for resources. Colonies are of the same size and are composed by asexually reproducing haploid cells. Each cell can contribute to activity and fecundity of the colony, these contributions are traded-off. We assume that all cells within a colony are genetically identical but gene expression is affected by variation in the microenvironment experienced by individual cells. Through analytical theory and evolutionary agent-based modeling we show that the shape of the trade-off relation between somatic and reproductive functions, the type and extent of variation in within-colony microenvironment, and, in some cases, the number of genes involved, are important predictors of the extent of germ-soma specialization. Specifically, increasing convexity of the trade-off relation, the number of different environmental gradients acting within a colony, and the number of genes (in the case of random microenvironmental effects) promote the emergence of germ-soma specialization. Overall our results contribute towards a better understanding of the role of genetic, environmental, and microenvironmental factors in the evolution of germ-soma specialization.

Présentation de Denis Tverskoi

Mardi 8 février

Nadia Tahiri (Professeure, Département d’Informatique, Université de Sherbrooke)

Titre : Modélisation de la relation quantitative structure activité (QSAR) du passage placentaire des contaminants environnementaux

Résumé : La diversité croissante dans l’environnement de composés potentiellement fœtotoxiques est une préoccupation de santé publique. L’objectif de ce travail était de contribuer à l’élaboration de méthodes rapides et efficaces pour en évaluer l’exposition prénatale. La modélisation de la relation quantitative structure à activité (QSAR) est apparue comme une méthode de choix dans l’élaboration d’un modèle prédictif pour le passage placentaire des contaminants. Les ratios fœto-maternels de concentrations sanguines pour 105 contaminants ont été compilés à partir de la littérature, et 214 descripteurs moléculaires ont été générés. Dix modèles prédictifs ont été élaborés à l’aide du logiciel Molecular Operating Environnement (MOE) et des langages de programmation Python et R. Les jeux de données d’entraînement et de test ont été utilisés, respectivement, pour élaborer et valider les modèles. L’outil Applicability Domain v1.0 a été utilisé pour déterminer le domaine d’applicabilité (DA). Les modèles élaborés avec les méthodes de régression des moindres carrés partiels dans MOE et SuperLearner dans R, ont montré les meilleures valeurs de précision et de prédictivité avec des coefficients de détermination internes (R2) de 0,88 et 0,82, des R2 de validation croisée de 0,72 et 0,57, et des R2 externes de 0,73 et 0,74, respectivement. Le recouvrement de toutes les molécules du jeu de test par le domaine d’applicabilité a permis de démontrer la fiabilité et la pertinence des prédictions des modèles. Les résultats obtenus démontrent que les modèles élaborés peuvent aider à quantifier l’exposition fœtale aux composés toxiques de l’environnement à partir des concentrations sanguines de la mère.

Présentation de Nadia Tahiri

L’article de Nadia Tahiri

Mardi 15 février

Stéphane Samson et Vladimir Makarenkov (Département d'informatique, UQAM)

Titre : Analyse de recombinaison et de transferts horizontaux de gènes chez SARS-CoV-2

Résumé : La pandémie actuelle de SARS-CoV-2 fait partie des maladies infectieuses les plus dangereuses qui soient apparues dans l’histoire récente. L’hypothèse a été émise dans le passé que les souches humaines de coronavirus des épidémies de SARS aient passées des chauves-souris à l’homme par l’entremise d’hôtes intermédiaires tels les civettes (SARS-CoV) et les chameaux (MERS-CoV). Des études récentes suggèrent que le génome du SARS-CoV-2 est très similaire au coronavirus de certaines chauves-souris pour la plupart de ses gènes et, à certaines souches de coronavirus de pangolins malaisiens pour le domaine receptor binding (RB) de la protéine de pointe (spike protein). Dans cet exposé, nous présenterons les résultats d’une analyse de détection d’évènements de recombinaison ainsi que de transferts horizontaux de gènes sur les 11 gènes principaux du SARS-CoV-2 afin de mieux comprendre les mécanismes derrière son émergence chez les humains et le rôle de ces hôtes intermédiaires potentiels. Nous présenterons également notre nouveau logiciel SimPlot++ permettant de mesurer et visualiser la similarité génétique entre différentes espèces (ou groupes d’espèces) étudié(e)s.

Présentation de Stéphane Samson

Article de Makarenkov, Mazoure, Rabusseau et Legendre (BMC Ecology and Evolution, 2021) sur l’évolution des gènes du SARS-CoV-2

Lien vers le logiciel SimPlot++

Rapport de Dihia Baloul, Marina Marinelli et Audrey-Ann Sicard

Mardi 22 février

Jeremy Charlier (AI specialist, Banque Nationale du Canada)

Titre : Novel Encoding of sgRNA-DNA Sequences for Effective Off-Target Prediction in Gene Editing with Deep Learning

Résumé : Off-target predictions are crucial in gene editing research to improve existing prediction methods. Recently, significant progress has been achieved in the field of prediction of off-target mutations, particularly with CRISPR-Cas9 data, thanks to the use of deep learning. CRISPR-Cas9 is a precise gene editing technique allowing manipulations of DNA fragments. The encoding of sgRNA-DNA sequences for deep neural networks is a complex process, which impacts significantly the prediction accuracy. In this context, we propose a novel encoding of sgRNADNA sequences that is capable to aggregate the involved sequence data without any loss of information. In our experiments, we compare our novel encoding with the state-of-the-art sgRNADNA encoding. We demonstrate the superior accuracy of our approach in our simulations involving Feedforward Neural Networks (FFN) and Convolutional Neural Networks (CNN). We highlight the universality of our results by building several FFNs and CNNs with various layer depths and performing predictions on two popular public gene editing data sets, the CRISPOR data set and the GUIDE-seq data set. In all our experiments, the new encoding led to more accurate off-target prediction results, providing an improvement of the AUC of ROC curve metrics up to 35%.

Présentation de Jeremy Charlier

Article de Charlier, Nadon et Makarenkov (revue Bioinformatics, 2021) sur l'utilisation des méthodes d'apprentissage profond dans l'édition génomique

Rapport de Wasmi Algasim, Salwa Haidar et Zeinab Sherkatghanad

Mardi 1 mars

La semaine de relâche !

Mardi 8 mars

Alpha Boubacar Diallo (Senior Director of Bioinformatics, Pacific Biosciences)

Titre : Second and Third generation sequencing applications, challenges and beyond

Résumé : During this presentation, we will focus on Short reads and Long reads sequencing data generation and analysis. We will discuss some of the main challenges and issues and show how we can overcome them over the next few years.

Présentation d'Alpha Boubacar Diallo

Rapport de Florent Guilloteau et Patrice Naud

Mardi 15 mars

Vladimir Reinharz (Professeur, Département d’Informatique, UQAM)

Titre : Graphes pour la détections de motifs structuraux complexes dans l’ARN

Résumé : NA molecules fulfill a large amount of fundamental tasks in every living organism. To achieve this vast array of functions, from transmitting information to biological sensors, they rely on complex three-dimensional structures. A low level representation that only considers canonical base pairs, called the secondary structure, has mathematical properties making it suitable for study under a Boltzmann ensemble framework. Dynamic programming algorithms have shown to be particularly adept to understand the link between secondary structure and sequence in that framework. Yet this is not enough to fully grasp fine networks of interactions, critical to the function, that are not captured by the secondary structure. The Leontis-Westhof annotations of non-canonical interactions classifies all interactions beyond those in the secondary structure.

This ontology allows to represent RNA molecules in much more details, and can then be described as directed graph with labelled edges. The discovery of conserved sub-structures can be transposed to the problem of maximal edge sub-isomorphismes. While classically NP-hard, we can take advantage of structural properties to restrain the ensemble of admissible graphs. In this talk, I will present the algorithms we developed for that case and interesting results that where obtained [1]. In particular, I will highlight the hierarchical organization of sub-structures, and how they are spread over vastly different functions. I will then speculate briefly over the role of chemical-modifications and future work.

Reference:

[1] Mining for recurrent long-range interactions in RNA structures reveals embedded hierarchies in network families; Reinharz, Soule, Westhof, Waldispuhl and Denise; NAR, 2018

Présentation de Vladimir Reinharz

Mardi 22 mars

Mohamed Amine Remita (Département d’Informatique, UQAM)

Titre : An Evolutionary-based Variational Generative Models for Biological Sequences

Résumé : Generative frameworks designed for genomics data are emerging as powerful approaches to study complex phenomena in biology including protein functions and structures, single-cell RNA-seq analyses and phylogenetic-based studies. However, to study molecular evolutionary derived processes, most of deep generative models do not consider explicitly the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. Here we propose a method for a variational Bayesian generative model that jointly approximates the true posterior of local biological evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as the generalized time reversible model. The architecture of our method consists of a set of deep variational encoders that infer the parameters of evolutionary-latent-variable distributions and allows sampling; and a generative model that computes probability transition matrices from sampled latent variables and generates a distribution of sequence alignments from reconstructed ancestral states. We train the model via a low-variance variational objective function and a site-wise-stochastic gradient ascent algorithm. Experimentally, we show the effectiveness and efficiency of the method on synthetic sequence alignments simulated with several evolutionary schemas and on real virus aligned DNA sequences.

Présentation de Mohamed Amine Remita

Mardi 29 mars

Bogdan Mazoure (Chercheur en Intelligence Artificielle, laboratoire MILA, Google Brain et McGill University)

Titre : Introduction to Markov decision processes and applications

Résumé : Since their invention, Markov chains have played a fundamental role in stochastic process analysis and applications to statistical modeling like weather, stock markets, and more recently even text generation. One not widely known class of Markov chains incorporates an additional component, called “actions”, which allows the Markov decision processes to solve complex sequential decision making problem such as game playing. This lecture will first cover fundamental concepts of Markov chains, then show how they can be easily generalized by an MDP (Markov Decision Process) framework. Finally, applications of MDPs in the field of bioinformatics will be presented.