2012-04-18
Parsing
- Two views of syntactic structure:
- Phrase structure and dependency
Classical Parsing:
- Context-free grammar (CFG) and lexicon written
- This wasn't very good. It didn't give enough parses to sentences, and attempts to make it better (like adding categorization) didn't.
- So now we have annotated data, like the Penn TreeBank
- It's time consuming to write these at first, but then they save time and give us more information.
- We do it once, and then don't need to do it again.
- We can use the banks for evaluation. ...
Catalan numbers
- {1, 1, 2, 5, 14, 42, 132, 429, ...}
The problem of choosing the correct parse is AI-hard. But words are good predictors of attachment.
Grammars are defined by their grammar, symbols and lexicon. ie G = (T,C,N,S,L,R)
For probabilistic CFGs:
- defined by G = (T,N,S,R,P)
- Terminals, Nonterminals, Start symbol, Rules, Probability function.
-
Parsers are good. It can be done in cubic time if we get sentences in binary phrase-structure form. ie every tree is binary. (X̄ form probably comes in handy here, J/M don't say anything about it.)
Parsers
CKY: Cocke-Kasami-Younger
- Viterbi algo, essentially.
let the input be a string S consisting of n characters: a1 ... an.
let the grammar contain r nonterminal symbols R1 ... Rr.
# This grammar contains the subset Rs which is the set of start symbols.
let P[n,n,r] be an array of booleans. Initialize all elements of P to false.
for each i = 1 to n
for each unit production Rj -> ai
set P[i,1,j] = true
for each i = 2 to n -- Length of span
for each j = 1 to n-i+1 -- Start of span
for each k = 1 to i-1 -- Partition of span
for each production RA -> RB RC
if P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true
if any of P[1,n,x] is true (x is iterated over the set s, where s are all the indices for Rs) then
S is member of language
else
S is not member of language
Again binarization is crucial for O(n^3) runtime.
Steps:
- Initialize empty w*w array [for w len(sentence_to_be_parsed)]
- Put a priori probabilities for all possible tags of word w and "non-terminal" tags (S, NP, VP, etc.)
- Calc. probs of tags.