false
Catalog
Modeling Mood and Emotional Patterns from Speech i ...
Lecture Presentation
Lecture Presentation
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Hello, and welcome. I'm Amy Cohen, Associate Director for SMI Advisor and a clinical psychologist. I am pleased that you are joining us for today's SMI Advisor webinar, Modeling Mood and Emotional Patterns from Speech in Bipolar Disorder. SMI Advisor, also known as the Clinical Support System for Serious Mental Illness, is an APA and SAMHSA initiative devoted to helping clinicians implement evidence-based care for those living with serious mental illness. Working with experts from across the SMI clinician community, our interdisciplinary effort has been designed to help you get the answers you need to care for your patients. And now, I'd like to introduce you to the faculty for today's webinar, Dr. Melvin McGinnis. Dr. McGinnis, the Thomas B. and Nancy F. John Woodworth Professor of Bipolar Disorder and Depression, is the Director of the Heinz C. Prechter Bipolar Research Program, leading a team of over 30 faculty and staff with several projects focused on bipolar disorder. These projects include collaborative programs using induced pluripotent stem cells to model bipolar disorder, the use of mobile technology to monitor and predict mood state changes in the illness, as well as assessments of cognitive capacity of individuals with bipolar disorder. Additionally, Dr. McGinnis is the Associate Director for Research at the University of Michigan Depression Center. Dr. McGinnis, thank you for leading today's webinar. Thank you very much. And it's truly an honor to present this webinar. And I'm presenting work today that is done in collaboration with Emily Maurer-Proverst at the Computer Science and Engineering Department here at the University of Michigan. Here are our disclosures and here are our learning objectives. And firstly, I want to talk a little bit about how I understand bipolar disorder. I'm delighted that the DSM-5 now has energy as a fundamental component, sort of a screening element for bipolar mania. And in fact, I think of bipolar disorder as having energy as its fundamental feature. So what you're seeing on the screen here is how we generally understand bipolar disorder as a sort of dynamic state. And so when we're teaching our residents and students, we talk about mania as elevated in a high energy state and depression as a low energy state. And we often draw this sinusoidal curve to describe the situation. But when we look at patients and talk with patients and monitor patients very carefully over the course of a longer period of time, we all appreciate that manic symptoms and depressive symptoms, they fluctuate and it's a series of dynamic states. And so in the red here, you see the measures on the Hamilton Depression Rating Scale. And in blue, you see the measures of the Young Mania Rating Scales. You can see over the course of the year, each of those ticks at the bottom represent a weekly assessment by a clinician. And so we have individuals going up and down over the course of time. So a number of questions emerge. Can we describe these patterns? Can we develop formula or can we somehow put a computational measure to it? And then most importantly, can we anticipate change? So many of us in the course of our training and the course of our clinical work, we often talk with family members and it's not uncommon to hear a family member tell us, you know, I could just hear it in their voice, you know, there was something going on a couple of weeks ago, just couldn't quite put my finger on it, but there was something going on. So that led to us developing a hypothesis about speech. And so our hypothesis is simply, are there features in the acoustics of speech that could work as a proxy measure for our internal emotional mood and effective stage? Well, speech is modulated by our internal neurophysiological states and daily in our assessment of individuals, we use speech. We describe the nature of the speech and the content of the speech is of course important. Now one of the things that I've learned about is the concept of question zero. And question zero really focuses on why are you doing this project? What are you, do you hope to accomplish from this? So the purpose of the research that I'm going to be telling you about now is really to identify biologically relevant markers of bipolar disorder in speech. Why are we doing this? We want to develop strategies and models to describe individuals with bipolar disorder over time and hopefully develop predictive models that could be prognostic. So could we predict if someone is developing a manic episode or going into a depressed state? The relevance should be crystal clear. We use speech daily to monitor our patients or in the units or in our outpatient services, speech is fundamental to our interactions with our patients. And as it turns out, speech has been looked at as for early warning signs over several years. And I think it's been 30, 40 years since people have been writing about speech and there are many illnesses that have been studied and I, you know, depression, autism, there's a bit of literature on autism, neurodegenerative disorders. And then when I looked up in the, you know, just in the literature and the web to see when did people start using mobile devices, I found this example of Sabrina 1954 and there is Humphrey Bogart using a mobile phone that actually really didn't work, but they portrayed it very nicely in the movies. Now, the limitations of the work that has been done by and large is that they've been limited primarily to laboratory settings and sort of more controlled settings and fixed texts. So people are saying sort of similar things over and over again and that, but the strategy that we have employed and the program that we're developing is called predicting individual outcomes for rapid intervention priori and it uses an application that records the outgoing speech in a secure manner, uploads it to the web and the engineers apply feature extraction again in a very secure manner and our goal is to identify points in time for intervention or a warning pattern. And we have in our database around 50, actually now around 80,000, I apologize, I didn't turn off my phone like we do in the theater, over 80,000 calls and about four or 5,000 hours of speech. Now the priori project is based in a longitudinal study of bipolar disorders who are following individuals over the course of time, we've followed individuals for up to approximately one year and we're developing mood recognition systems and the data that I'm going to be talking about is based on bipolar one and bipolar two individuals and healthy controls. So the outcomes of mood monitoring, so here again is our schematic or schema of what happens classically when we're thinking about depression and mania. So if you think there at those particular times, can we develop a signal? We're here in Michigan so we think in the automotive kind of terms and so the emoji there are the expressions, the kind of sort of a check light that comes on in one's dashboard and if we can identify this, can we mitigate the severity of the depression and the mania? So that emphasizes the goals that we have around this work. Now our data set includes two types of calls. We have personal calls, these are calls that are made during the course of the day over time as the individual goes about his or her day. We also have assessment calls. So these are calls that are made with a research clinician, it's a clinical interaction over the phone. We use the Young Mania Rating Scale and the Hamilton Depression Rating Scale on the phone to rate the individual's mood, particularly over the past week. And so we group the personal calls according to the assessment calls. And so the assessment calls are anchored on a weekly basis and the personal calls are grouped according to the assessment call. Now comes the fun part. The stages of the work that we have been doing is, I'm going to be presenting three stages, firstly the rhythm stage and actually we've, at the end of the talk we'll be going into really the fourth stage. But we began looking at rhythm several years ago and it is well known that individuals with depression exhibit speech that is slow. So the question was, can we use rhythmic features, can we extract features of rhythm to categorize the speech? Is it depressed or is it manic? Now rhythm is something like, you know, a sad soul sitting on a sofa singing a sorrowful song and that's kind of an exaggerated example of the rhythm. And so what the computational models do is that they will segment and sub-segment speech features, they will extract the rhythm feature of each sub-segment and then they will apply call level statistics to each of these segments. So there are approximately 217 statistics that are extracted and they're subjected to an analysis that is called support vector machines, which is really finding the plane between sets of data to determine if we can distinguish mood states, either mania or depression from other states, from the euthymic state. Now the result of this was that yes, we were able to do this in the area under the curve, the receiver operating characteristic is about 0.7. So what the area under the curve refers to is just the ratio of false positives to false positives, true positives to false positives. And so an area under the curve of 0.5, it just refers to a random, the random chance and 0.7 means that there's an increased likelihood that we're finding the false, the true positives compared to the false positives. And so below you see the reference on this. And so we've published with the engineers and the engineers in the computer science and engineering community, they like to publish in meetings and so there are a number of these four page publications that are in the literature and they're easily findable on Google Scholar. Often they're not in the PubMed literature that medical professionals are aware of. So we struggled for a while at this level of statistical significance of about 0.7. And the next phase of the work was to use a strategy that is referred to as identity vectors or I-vectors. Now I-vectors were initially developed for speaker identification tasks and this technology is used in the intelligence agency to, you know, in their surveillance strategies to identify who is speaking in the room or who's in the room. There are many applications of I-vectors and one of them is language recognition, dialect and accent recognition, speech diarization and so on. But really what it does is that it takes an audio signal and it translates it into a digital signal, ones and zeros, and many of you will be familiar with I-vectors in a similar program called Shazam, which you turn it on, you listen to some music and then it returns the fact that your stairway to heaven is playing on the radio or whatever the song may be. Now we rapidly get into a series of really high computational strategies and I'm showing you this really to emphasize the complexity of the approach that was used in the I-vector-based project. And so essentially, if you look at the center of the screen now, you see a column there with personal calls and the priority data set and the assessment calls. So what the personal calls did was that it took the speech over the course of the week and it determines this universal background matrix, the UBM training. And so it's not just what I sounded like in the context of an assessment call, if I were the individual calling in, talking with a clinical researcher and undergoing a Hamilton or a Young Mania rating scale, so it would take the personal calls over the previous week and it's not how I sound necessarily in the assessment calls, but how I sound relative to the personal calls for the data of which are available and presented in the I-vector extraction. And combined with the feature extraction from the assessment calls, it generates an output. So we were enthusiastic that there was an improvement in the area under the curve and we moved it from 0.7 to 0.78, but it wasn't necessarily a home run. And so we wanted to identify methods to improve our approach. Now in psychiatry, emotions are important. What is the relationship to emotions or from between emotions and mood? Mood prediction is incredibly challenging. It's not directly observable. It's something that the individual experiences. Emotions are things that individuals experience and there's a longer timescale in moods compared to emotions. So the model that we're working on here is looking at the timescale in the context of moods, disorders, self-reported emotion and expressions, and then the brevity of the speech content. Now I want to just emphasize that the timeline there is kind of just our model that we're working around. And I think most of us know and appreciate that it's well-established that all models are wrong. Some are useful. And so we're looking to develop a useful model around which to identify emotions, how we can use those metrics to help us understand mood. Taking a really simple question, can a measure, a metric of emotion simplify mood prediction? And in many respects, we can consider emotion dysregulation as one of the primary symptoms in bipolar disorder. Now this slide just emphasizes emotion and this is an expression of a soccer player after a victory in a game. And so you can just see the emotion that is expressed there. So now I want to take you through a rather complex series of experiments. And this is our emotion annotation pipeline. So in going over in lab meetings and working with the engineers, we had initially informed our participants that we would not be listening to the personal calls. This was a major part of the project, the initial phase of it. We were very optimistic that we could just use the assessment calls and analyze the personal calls and come up with some conclusions. As I showed you earlier, the area under the curve was 0.7. And despite a couple of three years of really knuckling down and working on it, we were unable to get our statistics any better than what we had. So we had to go back to the drawing board essentially and start looking at how we would annotate our data. So our IRB told us you could potentially do that as long as individuals provide consent, informed written consent, that we could listen to their personal calls and annotate them. And we were very pleased that we had a group of individuals that approved that or agreed to that. And so we selected 12 individuals based on their having established periods and just again defined as WIMARS and HAM-D greater than 10, you can debate whether that criteria are the validity or the reliability of such criteria, but that those are the criteria that we use. And so we had almost a thousand hours of speech from these 12 individuals that we subjected to experimentation. And so there's a series of segmentation and pre-processing and excluding of segments that we started to narrow down the number of segments that we had for analysis. And so this is the first pass brought us down to about 170,000 segments. And then we had a further narrowing of the sample set, excluding segments that are just brief. So somebody just calling up and there was really no answer or whatever, or segments of speech that were over 30 seconds at a time. Now the rationale there is we wanted to have very short segments for the annotators because we really had a design there just to have them listen to the sound rather than the content. Any segment of any length, there would be too much content to have an appreciation for the sound of it. So we then just identified 1,200 personal call segments for each subjects and then selected randomly 10 segments from each assessment call. That winnowed it down to about 17,000 segments. And in the final analysis, we excluded segments that had too much background noise and that, or the segment had identifiable information. So we didn't want to have identifiable information in that segment or annotators would know who that individual was. And so you can appreciate there's a considerable amount of work behind generating this data set that ended up being about 25 hours and nearly 14,000 segments and 11,400 were personal calls. So in defining emotion, you say. So what is emotion? So the problem with emotion is that there are many words and many categories. Anger, happiness, shame, disgust, and all kinds of things that you can describe for emotion. But we're just defining motion in two dimensions. Activation, which might be considered, you know, energy, you know, and going back to my previous or my early slide about energy being a fundamental element of bipolar disorder, fit nicely. And then valence, which is just a positive-negative dimension. So activation would be, you know, an energized speech. Gee, that's a really great idea, you know. And sometimes there's energy in that or, oh, I don't think I want to do that. That's a low activation. And valence has got a, you know, a positive, you know, wow, that's great kind of a thing to it. And so I'll show you the scale in the next slide, but we had 11 annotators that were the ages between 21 and 34. All were native speakers of English, and I should say that the participants were all American, U.S.-born, English-speaking as their mother tongue individuals that participated in the project. And so the annotators were trained. We had several training sessions, and they were only to rely on the acoustic characteristics of the speech, not the content. So that's why we had the six-second segments result. And we're just looking at subject specificity of the emotion expression. Emotion, again, being defined as valence and activation. So this slide shows, you know, the photo there shows a picture of the wonderful individuals that were hired and paid. These are primarily students, grad students here at the University of Michigan, that were hired to simply spend time listening and rating the speech. And so our computer scientists developed a program, you know, to have the information fed to the annotators. And so on the slides here, you see the, this is the self-assessment mannequins. And so the raters would essentially get the speech, they would listen to it, and they would rate the valence according to this, according to these scales. So this is a nine-point scale, and on the right is nine, and on the left, bottom left, or far left, is one. And so it's there simply listening to the speech and then rating the valence and rating the activation. And so on the upper left there, you see a two-dimensional plot that shows, you know, the distribution, a distribution of activation plotted against valence. And so you can see on the lower right, there's some, you know, the individuals are excited and negative, and that might align with an angry emotion. And on the upper right, you can see you have, you know, excitability and a positive, and so someone is, you know, very, very happy. As you saw on the soccer player there, on the slide, he was expressing his emotion and winning the game. And on the bottom line, you see the manuscript that describes this, and this is the work of Sohail Khorram, a postdoctoral fellow here in our team, that was responsible for this project. So how is this, how is this assessed? And so now we get into really interesting machine learning models that many of us are excited about, because I think that as we go forward in research and computational modeling of complex data sets, such as the one that we are presenting and talking about here, machine learning allows for, you know, integration of a series of data points in what is referred to as a neural network. Now why is it called a neural network? Well, it's called a neural network because it kind of functions like the brain functions, and so we think about, you know, going back to, you know, our neuroanatomy, and you remember how all the nerve cells were connected one to another, and then, you know, several, one cell could get input from many other cells, and it could send inputs to other cells. And that is essentially the functionality of a neural network. So the EGMAPs there, this is a Geneva Minimal Acoustic Parameter Set, which is a series of parameters that are standardized and can be applied to data sets. And the fast-forward, let's say fast-forward, the feed-forward neural network is shown there on the right, and so you can see the connections that are made, and there are multiple layers and multiple nodes. And so the work that we're, the experimentation that we have done is really based on approximately, I think, four, four, eight, and twelve layers with a hundred nodes per layer. The second approach to this is called a convolutional neural network, and a convolutional neural network is, you can think of it in any number of dimensions, and so that one layer kind of influences the second layer, and it's an amalgam of the input from these series of layers based on the, on the nodes in, within each, within each layer. Now this data is based on malfrequency bank features, and so these are based on the power spectrum of sound, based on transforming the features of sound into a series of dimensions that the, that form the input bases into these convolutional neural networks. And so you can see there on the right, the lower, there are the features that are, that are extracted. These are the malfilter banks that are extracted from the sound, and the input features go into the, into the various different, the convolutional kernels that feed into the layers, and then, and then the outcomes are assessed through a series of computational, computational formula. Now we got rather excited by this, these findings, because there was a significant correlation between the emotional measures and both the, the Geneva minimal acoustic parameters, the GMAPs, and the malfilter banks as measured by the convolutional neural networks. And the, the thing that I want just to take home here is that some systems work better than others, but at least they're consistent. And so while the, the convolutional neural networks seem to work better than the feed-forward, they were both significant. And over the years I've had scenarios where I've analyzed data doing two different systems, and one worked better, one worked well, and the other one didn't. So I was reassured that this worked better. So activation was more accurately recognized. And that was of interest to us, because why weren't we able to, why wasn't valence as, as so well correlated? And it turned out that activation is really just on the energy in the sound. It's a, you know, gosh John, you know, wow, you know. Whereas valence has an input really from the actual words itself. And so, well that's so nice, you know, kind of a thing. And so that's, that is a nice as a valence to kind of work. So linking mood and emotion is working. So they're significantly, you know, active and get correlated with, with, with mood severity. So thinking about just emotion from the valence and activation, that sort of varies in line with, but with, with mood severity. And that we found most encouraging. Now, so the next step, and this goes into, you know, phase four, which is really new data. And, and the, the data are actually available, or the, the findings are actually available on the bioarchive. The paper has been written and submitted. And in the current era, it's most reassuring that we can find venues to get the data out there. And so this is, these data are, I think it's Gideon et al. on bioarchive. And so what do we see in real life, what happens? And so the top half of this screen shows you the kind of data or information that you would extract from a medical record. So all of our participants, they're, they have, we have research data on them as shown in the graph below. But we also have access to medical record information from which we can extract information. So you can get things like the social worker notes, and you have comments there. We have lab values, lithium level 0.5. So there's a whole plethora of data that we can access. And so now, as many of you know, there are strategies out there to figure out how can we use the data and the electronic health records in a research capacity. And so there are several ways of doing that. We are actually doing it in a kind of an old-fashioned way of going through the data. I had a medical student just read, spend, you know, several months going through the information on all of our study participants. And then we started looking at the graphs there. And so what's going on over time of an individual. And so the clinicians would look at this. And so we try to rank, you know, whether someone was in a mixed manic episode or whether they were in a, whether they were stable. And they devised a way to code the various different states that the individuals are in. So then comes the real, the, you know, the question zero that I'm, that I talked about earlier on. When do we intervene? Can we identify a period of time that we could potentially intervene? And how can we test this hypothesis? How can we, how can we look at our data? How can we, can we figure out and ask the questions, you know, could we have picked up something? So we have a rather clever programmer in our group. And in our discussions, he developed a program that allows the observer, the clinician observer, to follow along in real time what the clinician, what a clinician might have seen in the clinic. So, so there are various different windows there. The top one are the mania and the depression ratings that we get from the Hamilton and from the young mania. There's a score there. And in the lower window are, you know, presenting to the research or the, yeah, the clinical researcher, the clinical notes and lab results from that selected week. And so, and then the lower third is a, you know, a series of buttons there to hit or click on whether you would flag this individual or whether you would not flag this individual. And then there's another button that pushes the data forward. And so you're presented with the following week. And so it was a rather clever way of testing in what might be considered an analogy to real time, whether a clinician would have done something. And so this is a question of when to intervene. The concept of necessary clinical adjustments has been introduced also as well. And so when would an adjustment have happened? And so when would somebody make a change in their outpatient treatment? When would they have done anything? And so the result of this is that, yes, we're able to predict the need to intervene in about half of the cases that we looked at. And so the top screen there is a rating of the mood based on zero being euthymia. And the clinician is just going and clicking based on the forward feed of the data and the information that they receive when a clinical person would possibly suggest that, yes, I think this would be a good time to intervene. And so they would just click that. They were totally just following the data over the course of time, looking at the feed, saying, yep, no, yes, no, and saying, you know, this is when I would intervene. And so you can see there on the top figure, the triangles are there, and you see three triangles there that indicate that the clinician felt strongly that there was a need for intervention at that time. Now, blindly, we then had the computer science and engineering folks evaluate in the methods that I showed you earlier on, a computational evaluation of the acoustics using these convolutional neural networks to look for anomalies in the acoustic patterns that would suggest that something is different and something needs to, that this might be a time when one might intervene. Now, sadly, they're not aligned perfectly, but you can see from the x-axis that the times that are identified overlap in this particular individual. And so I'm showing you this to demonstrate that there are early indicators that we can identify times when one should intervene. So our conclusion is that we have an emotional data set. It's very unique. It has a number of emotional segments in the speech, and it is one of the unique data sets out there. It's the engineers talk about in the wild telephonic data set that's annotated for mood and emotion. And so the mood symptoms, they correlate with the emotion, they correlate with activation and valence, and we have an ongoing process of annotation going forward. Now, I should say that we want to know, say, what are the challenges and what are the things that we are faced with that are difficulty? Well, context is important, for example. How do we assess context? And so that is something that we are actively working on at the moment. But we appreciate that we're convinced, we're enthusiastic, we're energized. I'm a clinician, I'm a psychiatrist, I work with the computer science and engineering folks. I am enthused by their energy and their enthusiasm, so the atmosphere is just ineffective, and it's just phenomenal to be in this space. I love it. So what are the potential uses? Well, as I've talked about, we have predictive use, potentially predictive, and or prognostic, however the word we want to use, of the in-course and the anticipation of need for interventions. We have another study going on in suicide that is showing promising results, you know, as well. Digital phenotyping. So could we develop some kind of a metric that would be analogous to the hemoglobin A1C in diabetes, that would give a measure of how things have been over the past period of time. And so the ability to measure an individual's instability of their mood as measured by their valence and activation, proxy measures in the speech parameters that we're looking at. So as we conclude, what can be gained from this presentation? I think that, you know, that all of us appreciate, all of us that are in the clinical field, appreciate that speech is fundamental to humanity, fundamental to how we interact with each other, fundamental as to how we evaluate our patients when they come to our clinics. We ask them how they're doing, we listen to their speech, we listen to the form of their speech, whether it's fast or slow, and then we pay attention to the content. And so, as I kind of hinted at and shared with you, the activation is, and kind of that would be kind of the form, is more, it has a stronger correlation than the valence, which is more content-driven, so nice. Emotions are often short-lived bursts that may be expressed in speech. They're not always present at the time of our evaluations. We all have patients, for example, that come into our clinics, we evaluate them, and they hold it together for a short period of time. But one of the signs that we see is that, you know, if there's a family member in the waiting room, you know, they're there for a purpose and they will tell you, listen, you know, this, my family member had outbursts of, you know, over the past week that really made me concerned. Many categories of emotions, you know, occur in daily, in daily use, and the categories are difficult to characterize. We have to objectify and have ways of evaluating emotion and relating that to mood, and it is our thesis that it's just simply more efficient to consider activation and valence and put that in a two-dimensional grid and give an assessment or an appreciation of that. So activation, energy, form, valence, positive, negative, content. And so I think that that was my last slide, and here is my acknowledgement slide, and we are very fortunate to have the support of so many entities and individuals behind this work, and I particularly want to acknowledge the Heinz C. Prechter Bipolar Research Fund and the Richard Tam Foundation, but also we couldn't do this without our participant collaborators that are working with us in this project, and they are very enthusiastic and very willing, and they contribute so much of their time and energy to our work. And of course, there's a host of federal and other institutions, including NAMI, the National Alliance for the Mentally Ill, that has been so kind and invited us to present at their conferences, and all, of course, the other institutions that support our work. So with that, I thank you very much for your audience and look forward to your questions. Thank you so much for an interesting presentation, Dr. McGinnis. There was a lot of data in there. It's fascinating.
Video Summary
The video is a presentation by Dr. Melvin McGinnis on the topic of modeling mood and emotional patterns from speech in bipolar disorder. Dr. McGinnis is the Thomas B. and Nancy F. John Woodworth Professor of Bipolar Disorder and Depression and the Director of the Heinz C. Prechter Bipolar Research Program. The presentation is part of the SMI Advisor webinar series, which is an initiative devoted to helping clinicians implement evidence-based care for those living with serious mental illness.<br /><br />Dr. McGinnis discusses the use of speech as a proxy measure for internal emotional mood and affective states in individuals with bipolar disorder. He explains research that has been done to extract acoustic features from speech and use machine learning algorithms to categorize the speech as either manic or depressive. He also discusses the correlation between emotional measures (activation and valence) and mood severity, as well as the potential use of speech analysis for predicting and intervening in mood episodes. Dr. McGinnis presents findings from studies using convolutional neural networks to analyze the acoustic patterns of speech and identify anomalies that could suggest the need for intervention.<br /><br />The presentation highlights the unique emotional dataset that has been created for this research and acknowledges the support of various institutions and participant collaborators. Dr. McGinnis concludes by stating the potential uses of this research, including predictive and prognostic interventions for mood episodes, digital phenotyping, and measuring mood instability.
Keywords
Dr. Melvin McGinnis
Modeling mood
Emotional patterns
Speech analysis
Bipolar disorder
Machine learning algorithms
Convolutional neural networks
Digital phenotyping
Funding for SMI Adviser was made possible by Grant No. SM080818 from SAMHSA of the U.S. Department of Health and Human Services (HHS). The contents are those of the author(s) and do not necessarily represent the official views of, nor an endorsement by, SAMHSA/HHS or the U.S. Government.
×
Please select your language
1
English