All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Similar Documents

Description

Structured Composition of Semantic Vectors Stephen Wu Mayo Clinic William Schuler The Ohio State University Abstract Distributed models of semantics assume

Transcript

Structured Composition of Semantic Vectors Stephen Wu Mayo Clinic William Schuler The Ohio State University Abstract Distributed models of semantics assume that word meanings can be discovered from the company they keep. Many such approaches learn semantics from large corpora, with each document considered to be unstructured bags of words, ignoring syntax and compositionality within a document. In contrast, this paper proposes a structured vectorial semantic framework, in which semantic vectors are defined and composed in syntactic context. As such, syntax and semantics are fully interactive; composition of semantic vectors necessarily produces a hypothetical syntactic parse. Evaluations show that using relationally-clustered headwords as a semantic space in this framework improves on a syntax-only model in perplexity and parsing accuracy. 1 Introduction Distributed semantic representations like Latent Semantic Analysis (Deerwester et al., 1990), probabilistic LSA (Hofmann, 2001), Latent Dirichlet Allocation (Blei et al., 2003), or relational clustering (Taskar et al., 2001) have garnered widespread interest because of their ability to quantitatively capture gist semantic content. Two modeling assumptions underlie most of these models. First, the typical assumption is that words in the same document are an unstructured bag of words. This means that word order and syntactic structure are ignored in the resulting vectorial representations of meaning, and the only relevant relationship between words is the same-document relationship. Second, these semantic models are not compositional in and of themselves. They require some external process to aggregate the meaning representations of words to form phrasal or sentential meaning; at best, they can jointly represent whole strings of words without the internal relationships. This paper introduces structured vectorial semantics (SVS) as a principled response to these weaknesses of vector space models. In this framework, the syntax semantics interface is fully interactive: semantic vectors exist in syntactic context, and any composition of semantic vectors necessarily produces a hypothetical syntactic parse. Since semantic information is used in syntactic disambiguation (MacDonald et al., 1994), we would expect practical improvements in parsing accuracy by accounting for the interactive interpretation process. Others have incorporated syntactic information with vector-space semantics, challenging the bagof-words assumption. Syntax and semantics may be jointly generated with Bayesian methods (Griffiths et al., 2005); syntactic structure may be coupled to the basis elements of a semantic space (Padó and Lapata, 2007); clustered semantics may be used as a pre-processing step (Koo et al., 2008); or, semantics may be learned in some defined syntactic context (Lin, 1998). These techniques are interactive, but their semantic models are not syntactically compositional (Frege, 1892). SVS is a generative model of sentences that uses a variant of the last strategy to incorporate syntax at preterminal tree nodes, but is inherently compositional. Mitchell and Lapata (2008) provide a general framework for semantic vector composition: p=f(u,v,r,k) (1) 295 where u and v are the vectors to be composed, R is syntactic context, K is a semantic knowledge base, and p is a resulting composed vector (or tensor). In this initial work of theirs, they leave out any notion of syntactic context, focusing on additive and multiplicative vector composition (with some variations): Add: p[i]=u[i]+v[i] Mult: p[i]=u[i] v[i] (2) Since the structured vectorial semantics proposed here may be viewed within this framework, our discussion will begin from their definition in Section 2.1. Erk and Padó s (2008) model also fits inside Mitchell and Lapata s framework, and like SVS, it includes syntactic context. Their semantic vectors use syntactic information as relations between multiple vectors in arriving at a final meaning representation. The emphasis, however, is on selectional preferences of individual words; resulting representations are similar to word-sense disambiguation output, and do not construct phrase-level meaning from word meaning. Mitchell and Lapata s more recent work (2009) combines syntactic parses with distributional semantics; but the underlying compositional model requires (as other existing models would) an interpolation of the vector composition results with a separate parser. It is thus not fully interactive. Though the proposed structured vectorial semantics may be defined within Equation 1, the end output necessarily includes not only a semantic vector, but a full parse hypothesis. This slightly shifts the focus from the semantically-centered Equation 1 to an accounting of meaning that is necessarily interactive (between syntax and semantics); vector composition and parsing are then twin lenses by which the process may be viewed. Thus, unlike previous models, a unique phrasal vectorial semantic representation is composed during decoding. Due to the novelty of phrasal vector semantics and lack of existing evaluative measures, we have chosen to report results on the well-understood dual problem of parsing. The structured vectorial semantic framework subsumes variants of several common parsing algorithms, two of which will be discussed: lexicalized parsing (Charniak, 1996; Collins, 1997, etc.) and relational clustering (akin to latent annotations (Matsuzaki et al., 2005; Petrov et al., 2006; Gesmundo et al., 2009)). Because previous work has shown that linguistically-motivated syntactic state-splitting already improves parses (Klein and Manning, 2003), syntactic states are split as thoroughly as possible into subcategorization classes (e.g., transitive and intransitive verbs). This pessimistically isolates the contribution of semantics on parsing accuracy it will only show parsing gains where semantic information does not overlap with distributional syntactic information. Evaluations show that interactively considering semantic information with syntax has the predicted positive impact on parsing accuracy over syntax alone; it also lowers per-word perplexity. The remainder of this paper is organized as follows: Section 2 describes SVS as both vector composition and parsing; Section 3 shows how relational-clustering SVS subsumes PCFG-LAs; and Section 4 evaluates modeling assumptions and empirical performance. 2 Structured Vectorial Semantics 2.1 Vector Composition We begin with some notation. This paper will use boldfaced uppercase letters to indicate matrices (e.g.,l), boldfaced lowercase letters to indicate vectors (e.g., e), and no boldface to indicate any singlevalued variable (e.g. i). Indices of vectors and matrices will be associated with semantic concepts (e.g., i 1,i 2,...); variables over those indices are single-value (scalar) variables (e.g., i); the contents of vectors and matrices can be accessed by index (e.g., e[i 1 ] for a constant, e[i] for a variable). We will also define an operation d( ), which lists the elements of a column vector on the diagonal of a diagonal matrix, i.e., d(e)[i, i]=e[i]. Often, these variables will technically be functions with arguments written in parentheses, producing vectors or matrices (e.g.,l(l) produces a matrix based on the value of l). As Mitchell and Lapata (2008) did, let us temporarily suspend discussion on what semantics populate our vectors for now. We can rewrite their equation (Equation 1) in SVS notation by following several conventions. All semantic vectors have a fixed dimensionality and are denoted e; source vectors and the 296 a) e ǫ (l MOD )S e 00 (l MOD )DT the e 0 (l MOD )NP e 01 (l ID )NN engineers pulled off... b) e 00= headwords iu it ip truth a T 0=[.1.1.8] e 01 = headwords iu it ip truth L 0 00(l MOD ) = M(l MOD NP l MOD DT l ID NN) = parent 0 iu it ip parent 0 iu it ip child 00 i p i t i u parent 0 i p i t i u Figure 1: a) Syntax and semantics on a tree during decoding. Semantic vectors e are subscripted with the node s address. Relations l and syntactic categories c are constants for the example. b) Example vectors and matrices needed for the composition of a vector at address 0 (Section 2.2.1). target vector are differentiated by subscript; instead of context variables R andk we will usem and L: e γ =f(e α,e β,m,l) (3) Syntactic context is in the form of grammar rules M that are aware of semantic concepts; semantic knowledge is in the form of labeled dependency relationships between semantic concepts, L. Both of these are present and explicitly modeled as matrices in SVS s canonical form of vector composition: e γ =M d(l γ α e α ) d(l γ β e β ) 1 (4) Here, M is a diagonal matrix that encapsulates probabilistic syntactic information, where the syntactic probabilities depend on the semantic concept being considered. The L matrices are linear transformations that capture how semantically relevant source vectors are to the resulting vector (e.g., L γ α defines the the relevance of e α to e γ ), with the intuition that two 1D vectors are under consideration and require a 2D matrix to relate them. 1 is a vector of ones this takes a diagonal matrix and returns a column vector corresponding to the diagonal elements. Of note in this definition of f( ) is the presence of matrices that operate on distributed semantic vectors. While it is widely understood that matrices can represent transformations, relatively few have used matrices to represent the distributed, dynamic nature of meaning composition (see Rudolph and Giesbrecht (2010) for a counterexample). 2.2 Syntax Semantics Interface This section aims to more thoroughly define the way in which the syntax and semantics interact during structured vectorial semantic composition. SVS will specify this interface such that the composition of semantic vectors is probabilistically consistent and subsumes parsing under various frameworks. Parsing has at times added semantic annotations that unwittingly carry some semantic value: headwords (Collins, 1997) are one-word concepts that subsume the words below them; latent annotations (Matsuzaki et al., 2005) are clustered concepts that touch on both syntactic and semantic information at a node. Of course, other annotations (Ge and Mooney, 2005) carry more explicit forms of semantics. In this light, semantic concepts (vector indices i) and relation labels (matrix arguments l) may also be seen as annotations on grammar trees. Let us introduce notation to make the connection with parsing and syntax explicit. This paper will denote syntactic categories ascand string yields asx. The location of these variables in phrase structure will be identified using subscripts that describe the path from the root to the constituent. 1 Paths consist of left and/or right branches (indicated by 0 s and 1 s, respectively, as in Figure 1a). Variables α, β, and ι stand for whole paths; γ is the path of a composed vector; and ǫ is the empty path at the root. The yield x γ is the observed (sub)string that eventually results from the progeny of c γ. Multiple trees τ γ can be constructed at γ by stringing together grammar rules that are consistent with observed text. 1 For simplicity, trees are assumed to be compiled into strictly binary-branching form. 297 2.2.1 Lexicalized Parsing To illustrate the definitions and operations presented in this section, we start with the concrete semantic space of headwords (i.e., bilexical parsing) before moving on to a formal definition. Our example here corresponds to the best parse of the first two words in Figure 1a. In this example domain, assume that the semantic space of concept headwords is{i pulled,i the,i unk }, abbreviated as{i p,i t,i u } where the last concept is a constant for infrequently-observed words. This semantic space becomes the indices of semantic vectors; complete vectorseat each node of Figure 1a are shown in Figure 1b. The tree in Figure 1a contains complete concept vectors e at each node, with corresponding indices i. Values in these vectors (see Figure 1b) are probabilities, indicating the likelihood that a particular concept summarizes the meaning below a node. For example, consider e 00 : i t produces the yield below address 00 ( the ) with probability 1, and i u may also produce the with probability 0.1. Not shown on the tree are the matrices in Figure 1b. In the parametrized matrix M(l MOD NP l MOD DT l ID NN), each diagonal element corresponds to the hypothesized grammar rule s probability, given a headword. Similarly, the matrixl 0 00 (l MOD ) is parametrized by the semantic context l MOD here, l MOD represents a generalized modifier semantic role. For the semantic concept i p at address 0, the left-child modifier (address 00) could be semantic concepti t with probability 0.2, or concept i u with probability 0.8. Finally, by adding an identity matrix for L 0 01 (l ID ) (a head semantic role) to the quantities in Figure 1b, we would have all the components to construct the vector at address 0: truth e 0 = d d = M L e L e e 0 Since the vector was constructed in syntactic and semantic context, the tree structure shown (including semantic relationshipsl) is implied by the context Probabilities in vectors and matrices Formally defining the probabilities in Figure 1, SVS populates vectors and matrices by means of 5 probability models (models are denoted by θ), along with the process of composition: Syntactic model M(lc γ lc α lc β )[i γ,i γ ]=P θm (lci γ lc α lc β ) Semantic model L γ ι (l ι )[i γ,i ι ]=P θl (i ι i γ,l ι ) Preterminal model e γ [i γ ]=P θp-vit(g) (x γ lci γ ), for pretermγ (5) Root const. model a T ǫ[i ǫ ]=P πgǫ (lci ǫ ) Any const. model a T γ[i γ ]=P πg (lci γ ) These probabilities are encapsulated into vectors and matrices using a convention: column indices of vectors or matrices represent conditioned semantic variables, row indices represent modeled variables. As an example, from Figure 1b, elements of L 0 00 (l MOD ) represent the probability P θl (i 00 i 0,l 00 ). Thus, the conditioned variable i 00 is shown in the figure as column indices, and the modeled i 0 as row indices. This convention applies to the M matrix as well. Recall that M is a diagonal matrix its rows and columns model the same variable. Thus, we could rewrite P θm (lci γ lc α lc β ) as P θm (lci γ lc α lc β,i γ ) to make a consistent probabilistic interpretation. We have intentionally left out the probabilistic definition of normal (non-preterminal) nonterminals P θvit(g), and the rationale fora T vectors. These are both best understood in the dual problem of parsing Vector Composition for Parsing The vector composition of Equation 4 can be rewritten with all arguments and syntactic information as: e γ =M(lc γ lc α lc β ) d(l γ α (l α ) e α ) d(l γ β (l β ) e β ) 1 (4 ) 298 iu it ip a compact representation that masks the underlying consistent probability operations. This section will expand the vector composition equation to show its equivalence to standard statistical parsing methods. Let us say that e γ [i γ ]=P(x γ lci γ ), the probability of giving a particular yield given the present distributed semantics. Recall that in matrix multiplication, there is a summation over the inner dimensions of the multiplied objects; replacing matrices and vectors with their probabilistic interpretations and summing in the appropriate places, each element of e γ is then: e γ [i γ ]=P θm (lci γ lc α lc β ) i α P θl (i α i γ,l α ) P θvit(g) (x α lci α ) i β P θl (i β i γ,l β ) P θvit(g) (x β lci β ) (6) This can be loosely considered the multiplication of the syntax (θ M term), left-child semantics (first sum), and right-child semantics (second sum). The only summations are between L and e, since all other multiplications are between diagonal matrices (similar to pointwise multiplication). We can simplify this probability expression by grouping θ M and θ L into a grammar rule P θg (lci γ lci α lci β ) def = P θm (lci γ lc α lc β ) P θl (i α i γ,l α ) P θl (i β i γ,l β ), since they deal with everything except the yield of the two child nodes. The summations are then pushed to the front: e γ [i γ ]= i α,i β P θg (lci γ lci α lci β ) P θvit(g) (x α lci α ) P θvit(g) (x β lci β ) (7) Thus, we have a standard chart-parsing probabilityp(x γ lci γ ) with distributed semantic concepts in each vector element. The use of grammar rules necessarily builds a hypothetical subtree τ γ. In a typical CKY algorithm, the tree corresponding to the highest probability would be chosen; however, we have not defined how to make this choice for vectorial semantics. We will choose the best tree with probability 1.0, so we define a deterministic Viterbi probability over candidate vectors (not concepts) and context variables: P θvit(g) (x γ lce γ ) def = e γ =argmax lce ι (a T ιe ι P πg (lca T ι) P θvit(g) (x lce ι )) (8) where is an indicator function such that φ =1 if φ is true, 0 otherwise. Intuitively, the process is as follows: we construct the vector e ι at a node, according to Eqn. 4 ; we then weight this vector against prior knowledge about the context a T ι ; the best vector in context will be chosen (the argmax). Also, the vector at a node comes with assumptions of what structure produced it. Thus, the last two terms in the parentheses are deterministic models ensuring that the best subtreeτ ι is indeed the one generated. Determining the root constituent of the Viterbi tree is the same process as choosing any other Viterbi constituent, except that prior contextual knowledge gets its own probability model in a T ǫ. As before, the most likely tree ˆτ ǫ is the tree that maximizes the probability at the root, and can be constructed recursively from the best child trees. Importantly, ˆτ ǫ has an associated, sentential semantic vector which may be construed as the composed semantic information for the whole parsed sentence. Similar phrasal semantic vectors can be obtained anywhere on the parse chart. These equations complete the linear algebraic definition of structured vectorial semantics. 3 SVS with Relational Clusters 3.1 Inducing Relational Clusters Unlike many vector space models that are based on the frequencies of terms in documents, we may consider frequencies of terms that occur in similar semantic relations (e.g., head l ID or modifier l MOD ). Reducing the dimensionality of terms in a term context matrix will result in relationally-clustered concepts. From a parsing perspective, this amounts to latent annotations (Matsuzaki et al., 2005) in l-context. 299 Let us re-notate the headword-lexicalized version of SVS (the example in Section 2.2.1) using h for headword semantics, and reserve i for relationally-clustered concepts. Treebank trees can be deterministically annotated with headwords h and relations l by using head rules (Magerman, 1995). The 5 SVS models θ M, θ L, θ P-Vit(G), π Gǫ, and π G can thus be obtained by counting instances and normalizing. Empirical probabilities of this kind are denoted with a tilde, whereas estimated models have a hat. Concepts i in a distributed semantic representation, however, cannot be found from annotated trees (see example concepts in Figure 2). Therefore, we use Expectation Maximization (EM) in a variant of the inside-outside algorithm (Baker, 1979) to learn distributed-concept behavior. In the M-step, the datainformed result of the E-step is used to update the estimates ofθ M,θ L, andθ H (whereθ H is a generlization ofθ P-Vit(G) to any nonterminal). These updated estimates are then plugged back in to the next E-step. The two steps continually alternate until convergence or a maximum number of iterations. E-step: M-step: ˆP(i γ,i α,i β lc γ,lc α,lc β )=ˆP θout (lci γ,lch ǫ lch γ ) ˆP θins (lch γ lci γ ) ˆP(lch ǫ ) E(lci γ,lci α,lci β )= ˆP(i γ,i α,i β lc γ,lc α,lc β ) P(lc γ,lc α,lc β ) ˆP θm (lci γ lc α,lc β )= i α,i β E(lci γ,lci α,lci β ) lciα,lci β E(lci γ,lci α,lci β ) ˆP θl (i α i γ ;l α )= lc γ,c α

Related Search

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks