All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Similar Documents

Description

Neuron Interaction Based Representation Composition for Neural Machine Translation arxiv: v1 [cs.cl] 22 Nov 2019 Jian Li, 1,2 Xing Wang, 3 Baosong Yang, 4 Shuming Shi, 3 Michael R. Lyu, 1,2 Zhaopeng

Transcript

Neuron Interaction Based Representation Composition for Neural Machine Translation arxiv: v1 [cs.cl] 22 Nov 2019 Jian Li, 1,2 Xing Wang, 3 Baosong Yang, 4 Shuming Shi, 3 Michael R. Lyu, 1,2 Zhaopeng Tu 3 1 Department of Computer Science and Engineering, The Chinese University of Hong Kong 2 Shenzhen Research Institute, The Chinese University of Hong Kong 3 Tencent AI Lab 4 University of Macau Abstract Recent NLP studies reveal that substantial linguistic information can be attributed to single neurons, i.e., individual dimensions of the representation vectors. We hypothesize that modeling strong interactions among neurons helps to better capture complex information by composing the linguistic properties embedded in individual neurons. Starting from this intuition, we propose a novel approach to compose representations learned by different components in neural machine translation (e.g., multi-layer networks or multihead attention), based on modeling strong interactions among neurons in the representation vectors. Specifically, we leverage bilinear pooling to model pairwise multiplicative interactions among individual neurons, and a low-rank approximation to make the model computationally feasible. We further propose extended bilinear pooling to incorporate first-order representations. Experiments on WMT14 English German and English French translation tasks show that our model consistently improves performances over the SOTA TRANS- FORMER baseline. Further analyses demonstrate that our approach indeed captures more syntactic and semantic information as expected. Introduction Deep neural networks (DNNs) have advanced the state of the art in various natural language processing (NLP) tasks, such as machine translation (Vaswani et al. 2017), semantic role labeling (Strubell et al. 2018), and language representations (Devlin et al. 2019). The strength of DNNs lies in their ability to capture different linguistic properties of the input by different layers (Shi, Padhi, and Knight 2016; Raganato and Tiedemann 2018), and composing (i.e. aggregating) these layer representations can further improve performances by providing more comprehensive linguistic information of the input (Peters et al. 2018; Dou et al. 2018). Recent NLP studies show that single neurons in neural models which are defined as individual dimensions of the representation vectors, carry distinct linguistic information (Bau et al. 2019). A follow-up work further reveals that Corresponding author: Zhaopeng Tu. Work was partially done when Jian Li and Baosong Yang were interning at Tencent AI Lab. Copyright c 2020, Association for the Advancement of Artificial Intelligence ( All rights reserved. simple properties such as coordinating conjunction (e.g., but/and ) or determiner (e.g., the ) can be attributed to individual neurons, while complex linguistic phenomena such as syntax (e.g., part-of-speech tag) and semantics (e.g., semantic entity type) are distributed across neurons (Dalvi et al. 2019). These observations are consistent with recent findings in neuroscience, which show that task-relevant information can be decoded from a group of neurons interacting with each other (Morcos and Harvey 2016). One question naturally arises: can we better capture complex linguistic phenomena by composing/grouping the linguistic properties embedded in individual neurons? The starting point of our approach is an observation in neuroscience: stronger neuron interactions directly exchanging signals between neurons, enable more information processing in the nervous system (Koch, Poggio, and Torre 1983). We believe that simulating the neuron interactions in nervous system would be an appealing alternative to representation composition, which can potentially better learn the compositionality of natural language with subtle operations at a smaller granularity. Concretely, we employ bilinear pooling (Lin, RoyChowdhury, and Maji 2015), which executes pairwise multiplicative interactions among individual representation elements, to achieve strong neuron interactions. We also introduce a low-rank approximation to make the original bilinear models computationally feasible (Kim et al. 2017). Furthermore, as bilinear pooling only encodes multiplicative second-order features, we propose extended bilinear pooling to incorporate first-order representations, which can capture more comprehensive information of the input sentences. We validate the proposed neuron interaction based (NIbased) representation composition on top of multi-layer multi-head self-attention networks (MLMHSANs). The reason is two-fold. First, MLMHSANs are critical components of various SOTA DNNs models, such as TRANS- FORMER (Vaswani et al. 2017), BERT (Devlin et al. 2019), and LISA (Strubell et al. 2018). Second, MLMHSANs involve in compositions of both multi-layer representations and multi-head representations, which can investigate the universality of NI-based composition. Specifically, First, we conduct experiments on the machine translation task, a benchmark to evaluate the performance of neural models. Experimental results on the widely-used WMT14 English German and English French data show that the NI-based composition consistently improves performance over TRANSFORMER across language pairs. Compared with existing representation composition strategies (Peters et al. 2018; Dou et al. 2018), our approach shows its superiority in efficacy and efficiency. Second, we carry out linguistic analysis (Conneau et al. 2018) on the learned representations from NMT encoder, and find that NI-based composition indeed captures more syntactic and semantic information as expected. These results provide support for our hypothesis that modeling strong neuron interactions helps to better capture complex linguistic information via advanced composition functions, which is essential for downstream NLP tasks. This paper is an early step in exploring neuron interactions for representation composition in NLP tasks, which we hope will be a long and fruitful journey. We make the following contributions: Our study demonstrates the necessity of modeling neuron interactions for representation composition in deep NLP tasks. We employ bilinear pooling to simulate the strong neuron interactions. We propose extended bilinear pooling to incorporate firstorder representations, which produces a more comprehensive representation. Experimental results show that representation composition benefits the widely-employed MLMHSANs by aggregating information learned by multi-layer and/or multihead attention components. Background Multi-Layer Multi-Head Self-Attention In the past two years, MLMHSANs based models establish the SOTA performances across different NLP tasks. The main strength of MLMHSANs lies in the powerful representation learning capacity provided by the multi-layer and multi-head architectures. MLMHSANs perform a series of nonlinear transformations from the input sequences to final output sequences. Specifically, MLMHSANs are composed of a stack of L identical layers (multi-layer), each of which is calculated as H l = SELF-ATT(H l 1 ) + H l 1, (1) where a residual connection is employed around each of two layers (He et al. 2016). SELF-ATT( ) is a self-attention model, which captures dependencies among hidden states in H l 1 : SELF-ATT(H l 1 ) = ATT(Q l, K l 1 ) V l 1, (2) where {Q l, K l 1, V l 1 } are the query, key and value vectors that are transformed from the lower layer H l 1, respectively. Instead of performing a single attention function, Vaswani et al. (2017) found it is beneficial to capture different context features with multiple individual attention functions (multihead). Concretely, multi-head attention model first transforms {Q, K, V} into H subspaces with different, learnable linear projections: 1 Q h, K h, V h = QW Q h, KWK h, VW V h, (3) where {Q h, K h, V h } are respectively the query, key, and value representations of the h-th head. {W Q h, WK h, WV h } denote parameter matrices associated with the h-th head. H self-attention functions (Equation 2) are applied in parallel to produce the output states {O 1,..., O H }. Finally, the H outputs are concatenated and linearly transformed to produce a final representation: H = [O 1,..., O H ] W O, (4) where W O R d d is a trainable matrix. Representation Composition Composing (i.e. aggregating) representations learned by different layers or attention heads has been shown beneficial for MLMHSANs (Dou et al. 2018; Ahmed, Keskar, and Socher 2018). Without loss of generality, from here on, we refer to {r 1,..., r N } R d for the representations to compose, where r i can be a layer representation (H l, Equation 1) or head representation (O h, Equation 4). The composition is expressed as H = COMPOSE(r 1,..., r N ), (5) where COMPOSE( ) can be arbitrary functions, such as linear combination 2 (Peters et al. 2018; Ahmed, Keskar, and Socher 2018) and hierarchical aggregation (Dou et al. 2018). Although effective to some extent, these approaches do not model neuron interactions among the representation vectors, which we believe is valuable for representation composition in deep NLP models. Approach Motivation Different types of neurons in the nervous system carry distinct signals (Cohen et al. 2012). Similarly, neurons in deep NLP models individual dimensions of representation vectors, carry distinct linguistic information (Bau et al. 2019; Dalvi et al. 2019). Studies in neuroscience reveal that stronger neuron interactions bring more information processing capability (Koch, Poggio, and Torre 1983), which we believe also applies to deep NLP models. In this work, we explore the strong neuron interactions provided by bilinear pooling for representation composition. Bilinear pooling (Lin, RoyChowdhury, and Maji 2015) is a recently proposed feature fusion approach in the vision field. Instead of linearly combining all representations, bilinear pooling executes pairwise multiplicative interactions among 1 Here we skip the layer index for simplification. 2 The linear composition of multi-head representations (Equation 4) can be rewritten in the format of weighted sum: O = H h=1 O hw O h with W O h R d H d. R3 R3 R3 R3R3 R3 R3 R3 = R3 R3 R3 R3R3 R3 1 R3 1 = R3 R3 R3 R3 R3R3 R3 R3 1 (a) Bilinear Pooling (b) Extended Bilinear Pooling Figure 1: Illustration of (a) bilinear pooling that models fully neuron-wise multiplicative interaction, and (b) extended bilinear pooling that captures both second- and first-order neuron interactions. R3 R3 1 = R3 individual R3 representations, to modelr3 full neuron R3 R3R3 interactions R3 1 R3 1 as shown in Figure 1(a). Note that there are many possible ways to implement the neuron interactions. The aim of this paper is not to explore this whole space but simply to show that one fairly straightforward implementation works well on a strong benchmark. Bilinear Pooling for Neuron Interaction Bilinear Pooling Bilinear pooling (Tenenbaum and Freeman 2000) is defined as an outer product of two representation vectors followed by a linear projection. As illustrated in Figure 1(a), all elements of the two vectors have direct multiplicative interactions with each other. However, in the scenario of multi-layer and multi-head composition, we generally have more than two representation vectors to compose (i.e., L layers and H attention heads). To utilize the full second-order (i.e. multiplicative) interactions in bilinear pooling, we concatenate all the representation vectors and feed the concatenated vector twice to the bilinear pooling. Concretely, we have: R = R R W B, (6) R = [r 1,..., r N ], (7) where R R R Nd Nd is the outer product of the concatenated representation R, denotes serializing the matrix into a vector with dimensionality (Nd) 2. In this way, all elements in the partial representations are able to interact with each other in a multiplicative way. However, the parameter matrix W B R (Nd)2 d and computing cost cubically increases with dimensionality d, which becomes problematic when training or decoding on a GPU with limited memory 3. There have been a few attempts to reduce the computational complexity of the original bilinear pooling. Gao et al. (2016) propose compact bilinear pooling to reduce the quadratic expansion of dimensionality for image classification. Kim et al. (2017) and Kong and Fowlkes (2017) propose low-rank bilinear pooling for visual question answering and image classification respectively, which further reduces the parameters to be learned and achieves comparable effectiveness with full bilinear pooling. In this work, we focus on the low-rank approximation for its efficiency, and generalize from the original model for deep representations. 3 For example, a regular TRANSFORMER model requires a huge amount of 36 billion ((Nd) 2 d) parameters for d = 1000 and N = 6. Low-Rank Approximation In the full bilinear models, each output element R i R 1 can be expressed as Nd Nd R i = wjk,i B R j R k j=1 k=1 = R Wi B R, (8) where Wi B R Nd Nd is a weight matrix to produce output element R i. The low-rank approximation enforces the rank of Wi B to be low-rank r Nd (Pirsiavash, Ramanan, and Fowlkes 2009), which is then factorized as U i Vi with U i R Nd r and V i R Nd r. Accordingly, Equation 8 can be rewritten as R i = R U i Vi R = ( R U i R V i )1 r, (9) where 1 r is a r-dimensional vector of ones, represents element-wise product. By replacing 1 r with P R r d, and redefining U R Nd r and V R Nd r, the low-rank approximation can be defined as R = ( R U R V)P. (10) In this way, the computation complexity is reduced from O(d 3 ) to O(d 2 ). And the parameter matrices U, V, and P are now feasible to fit in GPU memory. Extended Bilinear Pooling with First-Order Representation Previous work in information theory has proven that second-order and first-order representations encode different types of information (Goudreau et al. 1994), which we believe also holds on NLP tasks. As bilinear pooling only encodes second-order (i.e., multiplicative) interactions among individual neurons, we propose the extended bilinear pooling to inherit the advantages of first-order representations and form a more comprehensive representation. Specifically, we append 1s to the representation vectors. As illustrated in Figure 1(b), we respectively append 1 to the two R vectors, then the outer product of them produces both second-order and first-order interactions among the elements. According to Equation 10, the final representation is revised as: [ ] [ ] R f = ( R U R V) P, (11) 1 1 where R is the concatenated representation as in Equation 7. As a result, the final representation R f preserves both multiplicative bilinear features (as in Equation 10) and first-order linear features (as in Equation 4). # Model # Para. Train Decode 1 TRANSFORMER-BASE 88.0M Existing representation composition 2 + Multi-Layer: Linear Combination +3.1M Multi-Layer: Hierarchical Aggregation +23.1M Multi-Head: Hierarchical Aggregation +13.6M Both (3+4) +36.7M This work: neuron-interaction based representation composition 6 + Multi-Layer: NI-based Composition +16.8M Multi-Head: NI-based Composition +14.1M Both (6+7) +30.9M Table 1: Translation performance on WMT14 English German translation task. # Para. denotes the number of parameters, and Train and Decode respectively denote the training speed (steps/second) and decoding speed (sentences/second). We compare our model with linear combination (Peters et al. 2018) and hierarchical aggregation (Dou et al. 2018). Applying to TRANSFORMER TRANSFORMER (Vaswani et al. 2017) consists of an encoder and a decoder, each of which is stacked in 6 layers where we can apply multi-layer composition (excluding the embedding layer) to produce the final representations of the encoder and decoder. Besides, each layer has one (in encoder) or two (in decoder) multihead attention component with H heads, to which we can apply multi-head composition to substitute Equation 4. The two sorts of representation composition can be used individually, while combining them is expected to further improve the performance. Setup Experiments Dataset We conduct experiments on the WMT2014 English German (En De) and English French (En Fr) translation tasks. The En De dataset consists of about 4.56 million sentence pairs. We use newstest2013 as the development set and newstest2014 as the test set. The En Fr dataset consists of million sentence pairs. We use the concatenation of newstest2012 and newstest2013 as the development set and newstest2014 as the test set. We employ BPE (Sennrich, Haddow, and Birch 2016) with 32K merge operations for both language pairs. We adopt the casesensitive 4-gram NIST score (Papineni et al. 2002) as our evaluation metric and bootstrap resampling (Koehn 2004) for statistical significance test. Models We evaluate the proposed approaches on the advanced TRANSFORMER model (Vaswani et al. 2017), and implement on top of an open-source toolkit THUMT (Zhang et al. 2017). We follow Vaswani et al. (2017) to set the configurations and have reproduced their reported results on the En De task. The parameters of the proposed models are initialized by the pre-trained TRANS- FORMER model. We have tested both Base and Big models, 4 The original result in (Dou et al. 2018) is 28.63, which is caseinsensitive. As we report case-sensitive scores, we have requested Dou et al. to get this result. which differ at hidden size (512 vs. 1024) and number of attention heads (8 vs. 16). Concerning the low-rank parameter (Equation 9), we set low-rank dimensionality r to 512 and 1024 in Base and Big models respectively. All models are trained on eight NVIDIA P40 GPUs where each is allocated with a batch size of 4096 tokens. In consideration of computation cost, we study model variations with Base model on the En De task, and evaluate overall performance with Big model on both En De and En Fr tasks. Comparison to Existing Approaches In this section, we evaluate the impacts of different representation composition strategies on the En De translation task with TRANSFORMER-BASE, as listed in Table 1. Existing Representation Composition (Rows 1-5) For the conventional TRANSFORMER model, it adopts multihead composition with linear combination but only uses top-layer representation as its default setting. Accordingly, we keep the linear multi-head composition (Row 1) unchanged, and choose two representative multi-layer composition strategies (Rows 2 and 3): the widely-used linear combination (Peters et al. 2018) and the effective hierarchical aggregation (Dou et al. 2018). The hierarchical aggregation merges states of different layers through a CNN-like tree structure with the filter size being two, to hierarchically preserve and combine feature channels. As seen, linearly combining all layers (Row 2) achieves improvement over TRANSFORMER-BASE with almost the same training and decoding speeds. Hierarchical aggregation for multi-layer composition (Row 3) yields larger improvement in terms of score, but at the cost of considerable speed decrease. To make a fair comparison, we also implement hierarchical aggregation for multihead composition (Rows 4 and 5), which consistently improves performances at the cost of introducing more parameters and slower speeds. The Proposed Approach (Rows 6-8) Firstly, we apply our NI-based composition, i.e. extended bilinear pooling, for multi-layer composition with the default linear multi-head Architecture EN DE EN FR # Para. Train # Para. Train Existing NMT systems: (Vaswani et al. 2017) TRANSFORMER-BASE 65M n/a 27.3 n/a n/a 38.1 TRANSFORMER-BIG 213M n/a 28.4 n/a n/a 41.8 Our NMT systems TRANSFORMER-BASE 88M M NI-Based Composition 118M M TRANSFORMER-BIG 264M M NI-Based Composition 387M M Table 2: Comparing with existing NMT systems on WMT14 English German ( EN DE ) and English French ( EN FR ) translation tasks. : significantly better than the baseline (p 0.01) using boo

Related Search

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks