In this week's Deep Learning Paper Review, we look at the following paper: DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation.
What's Exciting about this Paper
This week's paper interestingly leverages the Relational Graph Convolutional Network in the automatic ERC (emotion recognition in conversations). It solves the current works’ limitations in the conversations with multiple speakers. As shown in Fig. 1, the “Okay” from speaker 1 attends to speaker 2’s complaints. RNNs-based approach would classify this to neutral as it ignores the speaker-level dependency. This paper is exciting since it provides a new solution by treating the utterances as the directed graphs and modeling them with the relational transformation.
The Paper's Key Findings
In recent years, the Graph Neural Network has been proven efficient in modeling the dependencies between nodes in a graph. It efficiently captures temporal dependency as well as speaker dependency by defining the edge relations and relational graph transformation. Empirically, it demonstrates the superiority in recognizing the emotions in multi-party conversations with long time-frame inputs.
Our Takeaways
The key takeaway is to encode these enriched dependencies into the utterances’ representations. The proposed model DialogueGCN uses the RNNs (in the paper is bidirectional GRUs) to embed the sequential based utterances, which are then used to construct the relational graph and passed to a stack of GCN.
The relational graph consists of {Node, Edge Weights, Edge Relation}. The nodes are the utterances representations, which are connected to each other within a context window (i.e. from the past 15 utterances to the future 15 utterances). The edge weights are learnable weights in the graph transformation. Most importantly, the edge weights are defined based on (1) node’s relative position to one another (i.e. 1st utterance has a “towards past” edge relation to the 3rd utterance); (2) node’s speaker type (i.e. utterance from speaker 1 has a “speaker 1 towards speaker 2” to utterance from speaker 2).
By the relational graph transformation, the node aggregates the neighborhood information from the nodes sharing the same relations, which encodes both the speaker dependency and temporal dependency to the node features. In the end, the GCN outputs are concatenated to the RNNs encoded outputs and passed to a fully connected classifier to obtain the emotion labels.