Compiling CFG Parsers in Prolog


COMPILED CFG RECOGNIZERS FOR PROLOG
======================================================================
Bob Carpenter
Carnegie Mellon University
31 October 1995

This document describes what I believe to be optimal methods of
compiling several popular CFG recognition algorithms directly to
Prolog:

  - Shift-Reduce
  - Left-Corner
  - Top-Down
  - Dynamic Chart (Bottom-Up, Right-to-Left)

Less drastic compilations would have similar performance and perhaps
be easier to use.  

This exercise was party carried out in response to the rather
misleading results reported by Covington in his book _Natural Language
Processing for Prolog Programmers_ (1994, Prentice Hall).  Covington's
reported timings suffer from (1) a failure to exploite indexing, (2)
passing redundant structures, and (3) small sentence lengths in the
inputs.

In the remainder of this document, I describe the target of the
compilation based on a given grammar.  There are restrictions on the
format of the grammar, but these could be alleviated by grammar
transformations of various kinds, or by slight extensions of the
compilation strategies.  

Each example comes with a top-level recognition predicate

    rec(Ws,C)

which holds if the string of words Ws can be recognized as being of
category C.

Timing statistics are presented after the descriptions of the
recognizers themselves.  These statistics are derived by backtracking
through all paths, not by stopping after the first solution.  The
non-memoizing recognizers are clearly exponential, whereas the chart
recognizer is growing at a less than cubic clip.  

Unfortunately, the timing was not done with a reasonable sized
grammar.  There was no lexical ambiguity, and only a large enough
grammar to cover the structures generated and a few more.  I would be
interested in testing other CFGs, especially ones of a larger scale
that were designed to cover realistic data.  But in these cases, the
chart recognizer should have an even greater advantage.  Of course,
that depends on whether the ambiguity induced is top-down (C0 --> Cs
for lots of different Cs), or bottom-up (C0 --> C1, Cs for lots of
different C0 & Cs).  

Of course, results will also differ if all *parses* are to be
generated.  In this case, the advantage of the chart recognizer will
be reduced to some extent because of the lack of filtering from
existing edges in the chart.  But there will still be a gain from not
having to redo subanalyses.

Results will also differ for strings with less ambiguity than the ones
tested.  In these cases, the non-memoizing parsers should dominate,
assuming there are not too many blind paths to follow.

Slightly extended versions of these recognizers could be adapted to
full DCG grammar format.  



SHIFT-REDUCE RECOGNIZER
======================================================================
Restrictions: no empty categories or looping unary rules

rec(Ws,C):-
  srrec(Ws,[],C).

srrec([W|Ws],Cs,C):-          % shift
  lexrec(W,Ws,Cs,C).

C0 --> C1,...,CN
-----------------
srrec(Ws,[CN,...,C1|Cs],C):-  % reduce
  srrec(Ws,[C0|Cs],C).

C0 --> [W]
---------
lexrec(W,Ws,Cs,C):-
  srrec(Ws,[C0|Cs],C).


LEFT-CORNER RECOGNIZER
======================================================================
Grammar Restrictions: no empty left corners

rec(Ws,s):-
  s(Ws,[]).

C a category
------------
C(C,Ws,Ws).              % connection

C([W|Ws],WsOut):-        % bottom-up expansion
  lex(W,C,Ws,WsOut).     % needless!! -- quicken w/o empties

C --> []
--------
C(Ws,Ws).                % satisfy empty category

C0 --> C1,...,CN
-----------------
C1(C,WsIn,WsOut):-      % expand rule from left-corner
  C2(WsIn,WsMid1),      % only if C1 is possible left-corner of C
  ...,
  CN(WsMid1,WsMidN+1),
  C0(C,WsMidN+1,WsOut).

C0 --> [W]
----------             
lex(W,C,WsIn,WsOut):-   % lexical expansion
  C0(C,WsIn,WsOut).


TOP-DOWN RECOGNIZER
======================================================================
Grammar Restrictions: no left-recursive rules
    np --> det, n
    n --> n, pp 
   vp --> vp, pp
    s --> np, vp
 converted to:
    np --> det, n, ppstar
    ppstar --> []
    ppstar --> pp, ppstar
    s --> np, vp, ppstar

rec(Ws,s):-
  s(Ws,[]).

C --> []
--------
C(Ws,Ws).

C0 --> C1,...,CN
-----------------
C0(WsIn,WsOut):-
  C1(WsIn,WsMid0),
  ...
  CN(WsMidN,WsOut).

C a category
------------
C([W|Ws],Ws):-
  lexC(W).

C0 --> [W]
----------             
lexC0(W).


BOTTOM-UP LEFT-TO-RIGHT CHART PARSER
======================================================================
Grammar Restrictions: no empty categories
Notes:  optimized by keeping dynamic edges dynamic rather than in chart.

rec(Ws,s):-
  cleanchart,                     % remove old edges
  reversecount(Ws,WsRev,N),       % reverse words and count length
  build_chart(WsRev,N),           % construct full chart
  edges(0,N).                     % find s edge spanning chart


build_chart([],_).                % complete chart
build_chart([W|Ws],R):-          
  L is R-1,
  ( lex(W,L,R)                    % add lexical entries for word
  ; build_chart(Ws,L)             % build rest of chart
  ).


C0,...,CN all categories
------------------------
cleanchart:- 
  retractall(edgeC0(_,_)),
  ...,
  retractall(edgeCN(_,_)).

C --> [W]
---------
lex(W,Left,Right):-
  \+ edgeC(Left,Right),
  assertz(edgeC(Left,Right)),
  ruleC(Left,Right).

C0 --> C1,...,CN
----------------
ruleC1(Left,Right):-
  edgeC2(Right,Mid2),
  ...,
  edgeCN(MidN-1,Right).


RECOGNITION TIMING
======================================================================

Language:  Compiled (native) SICStus Prolog Version 3.0
Machine:   SPARCstation 5, 85MHz, 32MB RAM

Grammar:  fairly minimal to cover sentence

String of form: PN TV Det N (P Det N)^k

Run Time in Milliseconds (1/1000 second)

Input                    PARSER
|n|      BUChart     LC       SR  SimpDCG  LexDCG
4           4        0         2       0
7           6        0         6             
10                            52       0
13         18        0       402
16         26        7      3184       0   
19                         15870         
22         49       80                10   
25                 295
28         79     1113                90
31                4270               305      220
34        122    16330              1100      795
37        150    62520              3810     2805
40                                 13585     9975
43        211                      48875    36780
49        290
55        394
64        590
73        753
82       1172
94       1722

Note: SimpDCG uses DCG compiler directly -- takes a hit from 'C'/3.
Directory:
Home | Schedule | Projects | Publications | CV | Personal