Spring 2012

# 2012-04-18

## Parsing

• Two views of syntactic structure:
• Phrase structure and dependency
Classical Parsing:
• Context-free grammar (CFG) and lexicon written
• This wasn't very good. It didn't give enough parses to sentences, and attempts to make it better (like adding categorization) didn't.
• So now we have annotated data, like the Penn TreeBank
• It's time consuming to write these at first, but then they save time and give us more information.
• We do it once, and then don't need to do it again.
• We can use the banks for evaluation. ...
Catalan numbers
• {1, 1, 2, 5, 14, 42, 132, 429, ...}
The problem of choosing the correct parse is AI-hard. But words are good predictors of attachment.
Grammars are defined by their grammar, symbols and lexicon. ie G = (T,C,N,S,L,R)
For probabilistic CFGs:
• defined by G = (T,N,S,R,P)
• Terminals, Nonterminals, Start symbol, Rules, Probability function.
Parsers are good. It can be done in cubic time if we get sentences in binary phrase-structure form. ie every tree is binary. (X̄ form probably comes in handy here, J/M don't say anything about it.)

## Parsers

CKY: Cocke-Kasami-Younger
• Viterbi algo, essentially.
Pseudocode (from wikipedia: CYK_algorithm:
 let the input be a string S consisting of n characters: a1 ... an. let the grammar contain r nonterminal symbols R1 ... Rr. # This grammar contains the subset Rs which is the set of start symbols. let P[n,n,r] be an array of booleans. Initialize all elements of P to false. for each i = 1 to n for each unit production Rj -> ai set P[i,1,j] = true for each i = 2 to n -- Length of span for each j = 1 to n-i+1 -- Start of span for each k = 1 to i-1 -- Partition of span for each production RA -> RB RC if P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true if any of P[1,n,x] is true (x is iterated over the set s, where s are all the indices for Rs) then S is member of language else S is not member of language 
Again binarization is crucial for O(n^3) runtime.
Steps:
1. Initialize empty w*w array [for w len(sentence_to_be_parsed)]
2. Put a priori probabilities for all possible tags of word w and "non-terminal" tags (S, NP, VP, etc.)
3. Calc. probs of tags.
http://cs.marlboro.edu/ courses/ spring2012/jims_tutorials/ elias/ 2012-04-18