语法分析简单总结

Posted on Aug 27, 2021

语法分析是词法分析的后一步，主要的目的就是通过词法分析器生成的 tokens 来生成抽象语法树（Abstract Syntax Tree，AST）。主要通过上下文无关文法（Context Free Grammar，CFG）来描述语言的语法，使用相应的算法更具 CFG 来解析 tokens 形成 AST。语法分析应该是编译原理中非常偏向数学的一个方面，并且已经有了很成熟的解决方案，这句话的意思就是语法分析和我学习的方向和目的关系不大。不过这个东西感觉还挺有意思，同时虽然关系不大，但是了解一下还是有必要的，这里简单总结一下，对于算法方面不准备详细写，之后有时间再补上吧（之后学校的编译原理课应该会在 LL，LR 这种地方花很多时间，到时候再来弄也无妨，现在就不浪费生命了）。在慕中语法分析占据了比较大的篇幅，我不觉得自己都听懂了，也有很多不记得了，所以下面写的东西里面可能有很多谬误，仅供参考了哈哈哈。

上下文无关文法

一个很炫酷的名词，就和确定有限状态机一样，听起来相当牛逼，其实不是特别复杂的东西。形式定义可以参考维基百科，不过这个定义我觉得一下子能看懂的人不多，实际上 CFG 就是一种语法规则，使用产生式、终结符、非终结符来定义一种语法的推导和归约过程。终结符的意思就是此符号无法再推导（也就是替换为）别的符号，非终结符则可以根据特定的规则推导出别的符号，产生式就是这个特定的规则。

比如，定义 $S,T$ 两种非终结符，$a,b$ 两种终结符，产生式（博客的 mathjax 好像有点问题，换行换不出来，我就用多个空格代替了。）：

$$ S \rightarrow Tb \ \ \ \ \ T \rightarrow Ta\ |\ \epsilon

在产生式的箭头左边的是单个非终结符，右边的则是该非终结符可以推导出的符号。产生式中的 $|$ 是一个简写，上面的第二个产生式和

$$ T \rightarrow Ta\ \ \ \
T \rightarrow \epsilon

是等价的。

上面的文法定义了可以由任意个（包括 0 个）$a$ 开头，以 $b$ 结尾的语言。特别的，这里的 $S$ 是开始符号，此 CFG 描述的语言最后都能够归约成 $S$。这样我们就获得了一个四元组 $<V,\Sigma,R,S>$，$V$ 是非终结符的集合，$\Sigma$ 是终结符的集合，$R$ 是产生式的集合，$S$ 是开始符号，这样就是一个上下文无关文法了。

CFG 中的上下文无关听起来很高深，其实这个无关指的仅仅是每个非终结符能够推导出来的符号和该终结符前后的符号无关。这是由产生式箭头左边只有一个非终结符决定的。比如

$$ aSa \rightarrow T

就不是上下文无关文法，因为 $S$ 必须要根据上下文才能推导出 $T$。

由于 CFG 可以处理嵌套的语法，使它相比起正则语言更适合做语法分析。比如 (((a + b) * c) ^ 2) 这样一个语句，用正则难以描述，因为括号的数量和出现的位置是不确定的，但是用 CFG 就可以容易地表达出来

$$ S \rightarrow (S)\ |\ T\ \ \ \
…

就解决了嵌套语法。CFG 的表达能力足够强大，能够表达绝大多数的编程语言，同时它也比较容易实现，所以在语法分析阶段往往使用 CFG。

二义性

对于一个 CFG 而言，很容易出现二义性，比如对于一个简单的减法语法

$$ S \rightarrow S - S\ |\ int

当我们用此语法来生成

$$ a - b - c

的解析树时，就可以生成两种解析树

graph TB
S1((S))
S2((S))
A((a))
B((b))
PLUS1(("-"))
C((c))
PLUS2(("-"))

a((a))
pLUS1(("-"))
s1((S))
s2((S))
b((b))
pLUS2(("-"))
c((c))

S1-->S2
S1-->C
S1-->PLUS1
S2-->A
S2-->PLUS2
S2-->B

s1-->a
s1-->pLUS1
s1-->s2
s2-->b
s2-->pLUS2
s2-->c

当 $a = 1, b = 2, c = 3$ 时，使用左边的解析树最后产生的结果将是 $-4$，使用右边则会是 $2$。如果一个表达式在某语法下可以产生多个最左解析树或最右解析树，那么这个语法就是有二义性的。

消歧

用于描述编程语言语法规则的较为直观的 CFG 往往都是有二义性的，不过这些 CFG 往往可以通过重写语法的方式修改为无二义性文法。有一些语言无法被重写为无二义性文法，那么该语言描述的编程语言就是有问题的，语法的二义性必然会造成生成和理解上的问题，所以这些语言我们不必关心。

重写语法

$$ S \rightarrow S - S\ |\ int

通过重写语法可以消除二义性，再以减法语法为例，从数学的定义来讲，$a - b - c$ 的正确运算顺序应该是 $(a - b) - c$，所以我们可以把语法重写为

$$ S \rightarrow S - T\ |\ T\ \ \ \
T \rightarrow int

这种情况下，之前的两种解析树就只会剩下左边的一颗了。

再举一个略复杂的例子，使用语法

$$ S \rightarrow S + S\ |\ S * S\ |\ int

对于输入字符串

$$ a + b * c

解析时，二义性会造成解析树的不唯一。所以我们要重写语法，保证乘法的优先级大于加法，且相同优先级时从左往右进行，比如这样重写

$$ S \rightarrow S + T\ |\ T\ \ \ \
T \rightarrow T * F\ |\ F\ \ \ \
F \rightarrow int

就不再存在二义性。

我们可以发现，重写语法虽然解决了二义性的问题，但是重写后的语法不再直观，难以理解，我们在设计 CFG 的时候，还是更希望使用更加直观的二义性语言。幸运的是，许多工具都提供消歧声明来自动化地消歧，比如 bison 提供了 %left （左结合）%right（右结合）的来对符号进行定义，实现优先级的定义。对于上面的语法重写，$S \rightarrow S + T$ 就是左结合的，这样对于相同的优先级，就会从左向右的运算了。而右结合就可以写作 $S \rightarrow T + S$，这样就会从右向左运算。

同时，重写时非终结符的嵌套关系，又决定了运算符的优先级，上面的例子中，表示乘法的非终结符被嵌套到了加法里面，所以乘法就会先于加法进行。在 bison 中，多个 %left 定义时，越靠下的就符号优先级就越高

%left '+' '-' 
%left '*' '/'

比如使用上面的定义，bison 就可以以加法和减法处于同一优先级，乘法和除法处于更高的优先级的方式实现消歧，这当然不是 bison 用了某种魔法理解了我们的定义表示的意思，而是使用了这个定义生成了正确的语法，实现了区分优先级的效果。

关于 bison 到底是怎样重写语法的，或者它究竟有没有重写语法来实现消歧，其实我并没有研究，因为网上相关的资料并不是很多，这种问题也懒得去翻文档了，对于这个 %left 定义，许多资料中都只是简单的说了一下通过这个就定义优先级了，没有深究，可能是觉得这个很明显，但是我个人觉得这个其实挺迷惑的，所以这里写一下自己的理解，可能有错，仅供参考。
假设仅给予 bison 一个 %left T2 的定义，和一个产成式 $S \rightarrow S\ T2\ S\ |\ T1$，那么我们可以认为 bison 会自动重写语法为
$$ S \rightarrow T\ T2\ S\ \ \ \
T \rightarrow T1
$$
即添加一个非终止符，然后把原产生式重写为左结合。
然后假设给予
%right T1
%left T2
的定义，和产生式 $S \rightarrow S\ T1\ S\ |\ S\ T2 \ S\ |\ T3$，由于这里 T1 被定义在 T2 之前，因此先重写 T1 为右结合式，然后重写 T2 为左结合式
$$ S \rightarrow T\ T1\ S\ \ \ \
T \rightarrow T\ T2\ F\ \ \ \
F \rightarrow T3
$$
这样 T2 的生成式被嵌套到了 T1 中，造成了 T2 的优先级大于 T1。

解析算法

CFG 本身只能回答输入是否属于该文法，要形成解析树，需要使用产生式作为“指导”，并使用特定算法解析。有许多算法可以实现解析，都有各自的优缺点。

自顶向下

所谓自顶向下，就是从开始符号进行推导，直到推导出输入的 tokens 序列为止。

递归下降

递归下降是一种搜索-回溯的算法，根据产生式的每种情况递归生成表达式，然后与输入字符串匹配，这个过程根据输入字符串的特点可以进行大量有效的剪枝。由于搜索-回溯是树状的，所以其实它很适合用来生成语法树。不过根据递归的性质，对于这样一个产生式

$$ T \rightarrow T\ S

它会一直从 $T$ 向下递归，永远无法到达 $S$，造成解析错误。这种文法称为左递归，对于递归下降算法而言，左递归是不可接受的，不过不管是手动还是自动的，左递归都可以被容易地消除。

同时递归下降在回溯地处理上比较复杂，需要处理大量的情况。

但是递归下降仍然是比较容易手动实现的一种算法，所以有许多编译器都以该算法实现语法分析，比如 gcc。

预测解析

作为一种自顶向下的算法，它不需要做任何搜索，对于每个输入他都有确定的跳转位置，类似于 DFA。但是为了达到这种效果，被匹配的语言需要是 $LL(1)$ 语言。

自底向上

移位归约

bison 就是使用这种算法进行匹配的。具体的有许多语言比如 LALR，SLR。慕课听完几天后就忘了，之后用到了再复习吧。

使用 bison 生成 COOL 的语法分析器

慕课提供了一个框架，我只要在里面填上 BNF 的一块就可以了，主要是需要学习一下提供的 AST 模块的用法，实现起来其实并不困难，按照 manual 里面的 Figure 1 一个个写出来就可以了。

/*
*  cool.y
*              Parser definition for the COOL language.
*
*/
%{
  #include <iostream>
  #include "cool-tree.h"
  #include "stringtab.h"
  #include "utilities.h"
  
  extern char *curr_filename;
  
  
  /* Locations */
  #define YYLTYPE int              /* the type of locations */
  #define cool_yylloc curr_lineno  /* use the curr_lineno from the lexer
  for the location of tokens */
  
    extern int node_lineno;          /* set before constructing a tree node
    to whatever you want the line number
    for the tree node to be */
  
  
      #define YYLLOC_DEFAULT(Current, Rhs, N)         \
      Current = Rhs[1];                             \
      node_lineno = Current;
  
  
    #define SET_NODELOC(Current)  \
    node_lineno = Current;
  
    /* IMPORTANT NOTE ON LINE NUMBERS
    *********************************
    * The above definitions and macros cause every terminal in your grammar to 
    * have the line number supplied by the lexer. The only task you have to
    * implement for line numbers to work correctly, is to use SET_NODELOC()
    * before constructing any constructs from non-terminals in your grammar.
    * Example: Consider you are matching on the following very restrictive 
    * (fictional) construct that matches a plus between two integer constants. 
    * (SUCH A RULE SHOULD NOT BE  PART OF YOUR PARSER):
  
    plus_consts	: INT_CONST '+' INT_CONST 
  
    * where INT_CONST is a terminal for an integer constant. Now, a correct
    * action for this rule that attaches the correct line number to plus_const
    * would look like the following:
  
    plus_consts	: INT_CONST '+' INT_CONST 
    {
      // Set the line number of the current non-terminal:
      // ***********************************************
      // You can access the line numbers of the i'th item with @i, just
      // like you acess the value of the i'th exporession with $i.
      //
      // Here, we choose the line number of the last INT_CONST (@3) as the
      // line number of the resulting expression (@$). You are free to pick
      // any reasonable line as the line number of non-terminals. If you 
      // omit the statement @$=..., bison has default rules for deciding which 
      // line number to use. Check the manual for details if you are interested.
      @$ = @3;
  
  
      // Observe that we call SET_NODELOC(@3); this will set the global variable
      // node_lineno to @3. Since the constructor call "plus" uses the value of 
      // this global, the plus node will now have the correct line number.
      SET_NODELOC(@3);
  
      // construct the result node:
      $$ = plus(int_const($1), int_const($3));
    }
  
    */
  
  
  
    void yyerror(char *s);        /*  defined below; called for each parse error */
    extern int yylex();           /*  the entry point to the lexer  */
  
    /************************************************************************/
    /*                DONT CHANGE ANYTHING IN THIS SECTION                  */
  
    Program ast_root;	      /* the result of the parse  */
    Classes parse_results;        /* for use in semantic analysis */
    int omerrs = 0;               /* number of errors in lexing and parsing */
    %}
  
    /* A union of all the types that can be the result of parsing actions. */
    %union {
      Boolean boolean;
      Symbol symbol;
      Program program;
      Class_ class_;
      Classes classes;
      Feature feature;
      Features features;
      Formal formal;
      Formals formals;
      Case case_;
      Cases cases;
      Expression expression;
      Expressions expressions;
      char *error_msg;
    }
  
    /* 
    Declare the terminals; a few have types for associated lexemes.
    The token ERROR is never used in the parser; thus, it is a parse
    error when the lexer returns it.
  
    The integer following token declaration is the numeric constant used
    to represent that token internally.  Typically, Bison generates these
    on its own, but we give explicit numbers to prevent version parity
    problems (bison 1.25 and earlier start at 258, later versions -- at
    257)
    */
    %token CLASS 258 ELSE 259 FI 260 IF 261 IN 262 
    %token INHERITS 263 LET 264 LOOP 265 POOL 266 THEN 267 WHILE 268
    %token CASE 269 ESAC 270 OF 271 DARROW 272 NEW 273 ISVOID 274
    %token <symbol>  STR_CONST 275 INT_CONST 276 
    %token <boolean> BOOL_CONST 277
    %token <symbol>  TYPEID 278 OBJECTID 279 
    %token ASSIGN 280 NOT 281 LE 282 ERROR 283
  
    /*  DON'T CHANGE ANYTHING ABOVE THIS LINE, OR YOUR PARSER WONT WORK       */
    /**************************************************************************/
  
    /* Complete the nonterminal list below, giving a type for the semantic
    value of each non terminal. (See section 3.6 in the bison 
    documentation for details). */
  
    /* Declare types for the grammar's non-terminals. */
    %type <program> program
    %type <classes> class_list
    %type <class_> class
  
    /* You will want to change the following line. */
    %type <features> feature_list 
    %type <feature> method attr

	%type <expressions> expression_list argument_list
	%type <expression> expression let_nested_expression

	%type <formals> formal_list
	%type <formal> formal

	%type <cases> case_list
	%type <case_> case_branch
  
    /* Precedence declarations go here. */
	%right IN
	%right ASSIGN 
	%left NOT
	%nonassoc LE '<' '='
    %left '+' '-'
	%left '*' '/'
	%left ISVOID
	%left '~'
	%left '@'
	%left '.'
  
    %%
    /* 
    Save the root of the abstract syntax tree in a global variable.
    */
    program	: class_list	{ @$ = @1; ast_root = program($1); }
    ;
  
    class_list
    : class			/* single class */
    { $$ = single_Classes($1);
    parse_results = $$; }
    | class_list class	/* several classes */
    { $$ = append_Classes($1, single_Classes($2)); 
    parse_results = $$; }
    ;
  
    /* If no parent is specified,  the class inherits from the Object class. */
    class
	: CLASS TYPEID '{' feature_list '}' ';'
    { $$ = class_($2, idtable.add_string("Object"), $4, 
    stringtable.add_string(curr_filename)); }
    | CLASS TYPEID INHERITS TYPEID '{' feature_list '}' ';'
    { $$ = class_($2, $4, $6, stringtable.add_string(curr_filename)); }
	| error '}' ';'
	{ $$ = 0; }
    ;
  
    /* Feature list may be empty, but no empty features in list. */
    feature_list
	: feature_list method ';'
	{ $$ = append_Features($1, single_Features($2)); }
	| feature_list attr ';'
	{ $$ = append_Features($1, single_Features($2)); }
	| /* this is empty */
	{ $$ = nil_Features(); }
	| error ';'
	{ $$ = 0; }
    ;

	method
	: OBJECTID '(' formal_list ')' ':' TYPEID '{' expression '}'
	{ $$ = method($1, $3, $6, $8); }
	;

	attr
	: OBJECTID ':' TYPEID	/* attr without init */
	{ $$ = attr($1, $3, no_expr()); }
	| OBJECTID ':' TYPEID ASSIGN expression	/* attr with init */
	{ $$ = attr($1, $3, $5); }
	;

	formal_list 
	: formal	/* only one formal */
	{ $$ = single_Formals($1); }
	| formal_list ',' formal /* more than one formals */
	{ $$ = append_Formals($1, single_Formals($3)); };
	| /* or zero */
	{ $$ = nil_Formals(); }
	;

	formal 
	: OBJECTID ':' TYPEID  
	{ $$ = formal($1, $3); };

	expression 
	: OBJECTID ASSIGN expression
	{ $$ = assign($1, $3); }
	/* expr@TYPE.ID(...) */
	| expression '@' TYPEID '.' OBJECTID '(' argument_list ')'
	{ $$ = static_dispatch($1, $3, $5, $7); }
	/* expr.ID(...) */
	| expression '.' OBJECTID '(' argument_list ')'
	{ $$ = dispatch($1, $3, $5); }
	/* ID(...) */
	| OBJECTID '(' argument_list ')'
	{ $$ = dispatch(object(idtable.add_string("self")), $1, $3); }
	/* if then else fi */
	| IF expression THEN expression ELSE expression FI
	{ $$ = cond($2, $4, $6); }
	/* while expr loop expr pool */
	| WHILE expression LOOP expression POOL
	{ $$ = loop($2, $4); }
	/* block expression */ 
	| '{' expression_list '}'
	{ $$ = block($2); }
	| '{' error '}'
	{ $$ = 0; }
	/* let */
	| LET let_nested_expression
	{ $$ = $2; }
	/* case */
	| CASE expression OF case_list ESAC
	{ $$ = typcase($2, $4); }
	| NEW TYPEID
	{ $$ = new_($2); }
	| ISVOID expression
	{ $$ = isvoid($2); }
	| expression '+' expression
	{ $$ = plus($1, $3);}
	| expression '-' expression
	{ $$ = sub($1, $3);}
	| expression '*' expression
	{ $$ = mul($1, $3);}
	| expression '/' expression
	{ $$ = divide($1, $3);}
	| '~' expression
	{ $$ = neg($2); }
	| expression '<' expression
	{ $$ = lt($1, $3);}
	| expression LE expression	/* LE represent '<=' */
	{ $$ = leq($1, $3);}
	| expression '=' expression
	{ $$ = eq($1, $3);}
	| NOT expression
	{ $$ = comp($2);}
	/* (expr) */
	| '(' expression ')'
	{ $$ = $2; }
	/* ID */
	| OBJECTID
	{ $$ = object($1); }
	/* integer */
	| INT_CONST
	{ $$ = int_const($1); }
	/* string */
	| STR_CONST
	{ $$ = string_const($1); }
	| BOOL_CONST
	{ $$ = bool_const($1); }
	;

	case_list
	: case_branch
	{ $$ = single_Cases($1); }
	| case_list case_branch
	{ $$ = append_Cases($1, single_Cases($2)); }
	;

	case_branch
	: OBJECTID ':' TYPEID DARROW expression ';' 
	{ $$ = branch($1, $3, $5); }

	let_nested_expression
	: OBJECTID ':' TYPEID IN expression
	{ $$ = let($1, $3, no_expr(), $5); }
	| OBJECTID ':' TYPEID ASSIGN expression IN expression
	{ $$ = let($1, $3, $5, $7); }
	| OBJECTID ':' TYPEID ',' let_nested_expression
	{ $$ = let($1, $3, no_expr(), $5); }
	| OBJECTID ':' TYPEID ASSIGN expression ',' let_nested_expression
	{ $$ = let($1, $3, $5, $7); }
	| error ','
	{ $$ = 0; }
	| error
	{ $$ = 0; }
	;

	expression_list
	: expression ';'
	{ $$ = single_Expressions($1); }
	| expression_list expression ';'
	{ $$ = append_Expressions($1, single_Expressions($2)); }
	;

	argument_list
	: expression 
	{ $$ = single_Expressions($1); }
	| argument_list ',' expression
	{ $$ = append_Expressions($1, single_Expressions($3)); }
	| /* this is empty */
	{ $$ = nil_Expressions(); }
	;
  
    /* end of grammar */
    %%
  
    /* This function is called automatically when Bison detects a parse error. */
    void yyerror(char *s)
    {
      extern int curr_lineno;
  
      cerr << "\"" << curr_filename << "\", line " << curr_lineno << ": " \
      << s << " at or near ";
      print_cool_token(yychar);
      cerr << endl;
      omerrs++;
  
      if(omerrs>50) {fprintf(stdout, "More than 50 errors\n"); exit(1);}
    }

最后也是用 life.cl 来测试一下

cheers!