年轻人的第一个词法分析器——COOL 的 lexer

Posted on Aug 20, 2021

花了一天多的时间实现了这个词法分析器，从骨架上开始写确实可以少关心很多繁杂的小问题，体验好了许多。

词法分析是编译的第一步，做的事情就是对源代码按照语法规则进行分词，并为其指定对应的类型，形成一系列 <类型，词素> 这样的二元组（token，词法单元）。

那么要实现这样一个词法分析器，需要做两件事，第一，匹配每个词素，第二，为每个词素指定类型。匹配词素作为一个模式匹配问题，使用正则匹配是很合适的。我们用一组正则表达式表达出所有的类型的词素，匹配到词素时就可以用正则表达式的类型给该词素指定类型了。当然在匹配过程中，要注意几个细节问题：

首先是匹配歧义的问题，以英语举例，对 input 这个字符串从左往右扫描，扫到 “n” 的时候其实已经可以匹配了，因为 “in” 就是一个合法的词素，但是正常人都会看到 “t” 为止，把这个字符串理解为 “input”，这就是最长匹配原则，当对代码进行匹配时，如果碰到 “=="，虽然 “=” 也显然满足正则表达式，但是我们仍然应该匹配到最长，也就是匹配失败的上一个接受状态。
然后是匹配优先级的问题，对于 else 这个标识符来说，它是一个关键字，那么它可以被 R_Keyword 匹配。但是光看变量的定义（即字母开头，由字母数字下划线组成的字符串）时，我们会发现 else 也会被 R_Variable 匹配。当然大多数（PL1 不是这样的）的语言都不允许变量是关键字，那么我们如何改写正则表达式呢？比较难。所以一个更好的方法是在匹配 R 的时候优先匹配 R_Keyword 这个表达式，这样 else 就不会被识别为变量了。
最后是词法错误的情况，如果代码中出现了词法错误，让匹配直接出错崩溃肯定是不可接受的，所以可以加一个 R_Fault，这个正则表达式包含了所有的错误情况。有了匹配优先级的思想，我们可以很简单地写出这个表达式，也就是字母表全集，放到 R 的最后即可。那么匹配不了的都会被这个正则表达式识别，然后对词法错误进行异常处理即可。

有了正则表达式后，就可以把表达式转为程序了，一个典型的方式是正则 -> NFA -> DFA -> 程序实现。有些情况下也会有不同的实现路径，比如直接实现 NFA。这里的每一步转换其实都是容易理解的，不过要程序化地实现从正则转到 NFA 应该会比较麻烦，这也和我学习计算机科学相关性甚小，所以直接使用工具即可，这里就是使用 flex 这个工具进行转换，可以直接将正则转成 C/C++。

关于 flex 的使用有前辈翻译了文档，基本上看这个就够用了，这里就不多说了。

有了工具，那就只需要写好正则表达式和对应的 action 就行了。比较简单的是关键字，只要完全匹配就可以了，当然要注意 COOL 对关键字是大小写不敏感的，所以可以这样写

CLASS		[cC][lL][aA][sS][sS]

不过其实 flex 也是提供了指定大小写不敏感的方法的，但是我觉得这样也很明了了。而且写这个式子也不复杂，由于我是 VIM 使用者，所以我只要先写下

CLASS		class

然后使用正则替换 :s/\([a-z]\)/\[\1\U\1\]/g 即可。

对于 typeid 类也是一样，内部的基础类处理起来都比较容易，COOL 对类型和变量命名有要求，大驼峰为类型名，小驼峰为变量名、方法名，所以与 objectid 类使用正则就可以直接区分。

行注释比较容易，忽略整行即可，不多说了。

空格也比较简单，忽略掉就行。

处理块注释是比较麻烦的，我一开始一直在尝试用正则直接把注释匹配出来，由于 COOL 的块注释是可以嵌套的，所以一直弄不好。后来发现 flex 有提供 input/yyinput 这个方法（前者用于 C，后者用于 C++）让我们可以在 pattern 的 action 中读取输入，由此对注释和字符串的处理就变得简单了。要注意注释的嵌套，可以用一个变量来维护左右括号的差，到零即表明注释结束。字符串也是一样，注意错误的处理即可。要特别注意字符串的终止定义，原文为

the beginning of the next line if an unescaped newline occurs at any point in the string; or
after the closing ” otherwise.

也就是如果换行不带 ‘\’，在换行的地方就终止了。

最后就是不认识的词素了，我写了一个巨长的表达式

([^" \\\:\n\t\f\r\v\+\-\*\/\~\<\=\(\)\{\}\;\!\@\#\$\%\^\&\'\>\_\.\[\]\,]+|\!|\@|\#|\$|\%|\^|\&|\'|\>|\_|\.|\[|\]|\,|\:|\\)

基本思路就是把空格和所有错误字符排除掉组成的串作为被匹配串，这个串不会超过之前的可能被匹配串的长度，所以不会影响其他表达式，然后错误字符需要自成一个错误，即每一个错误字符都报一次错，这样就或上所有的错误字符即可。

最后完成的 cool.flex 就是

/*
 *  The scanner definition for COOL.
 */

/*
 *  Stuff enclosed in %{ %} in the first section is copied verbatim to the
 *  output, so headers and global definitions are placed here to be visible
 * to the code in the file.  Don't remove anything that was here initially
 */
%{
#include <cool-parse.h>
#include <stringtab.h>
#include <utilities.h>

/* The compiler assumes these identifiers. */
#define yylval cool_yylval
#define yylex  cool_yylex

/* Max size of string constants */
#define MAX_STR_CONST 1025
#define YY_NO_UNPUT   /* keep g++ happy */

extern FILE *fin; /* we read from this file */

/* define YY_INPUT so we read from the FILE fin:
 * This change makes it possible to use this scanner in
 * the Cool compiler.
 */
#undef YY_INPUT
#define YY_INPUT(buf,result,max_size) \
	if ( (result = fread( (char*)buf, sizeof(char), max_size, fin)) < 0) \
		YY_FATAL_ERROR( "read() in flex scanner failed");

char string_buf[MAX_STR_CONST]; /* to assemble string constants */
char *string_buf_ptr;

extern int curr_lineno;
extern int verbose_flag;

extern YYSTYPE cool_yylval;

/*
 *  Add Your own definitions here
 */

static int ERR_report(char* msg);
static char error_msg[0x50];

%}

/*
 * Define names for regular expressions here.
 */

/*
 * tokens
 */

/* operators */

DARROW		=>
ASSIGN		<-
LE			<=
SINGLE_OPERATOR	(\+|\-|\*|\/|\~|\<|\=|\(|\)|\{|\}|\;|\:|\.|\,|\@)

/* Keywords */
CLASS		[cC][lL][aA][sS][sS]
ELSE		[eE][lL][sS][eE]
FI			[fF][iI]
IF			[iI][fF]
IN			[iI][nN]
INHERITS	[iI][nN][hH][eE][rR][iI][tT][sS]
ISVOID		[iI][sS][vV][oO][iI][dD]
LET			[lL][eE][tT]
LOOP		[lL][oO][oO][pP]
POOL		[pP][oO][oO][lL]
THEN		[tT][hH][eE][nN]
WHILE		[wW][hH][iI][lL][eE]
CASE		[cC][aA][sS][eE]
ESAC		[eE][sS][aA][cC]
NEW			[nN][eE][wW]
OF			[oO][fF]
NOT			[nN][oO][tT]
TRUE		t[rR][uU][eE]
FALSE		f[aA][lL][sS][eE]

/* basic class, token_class: TYPEID */
TYPEID		(Object|IO|String|Int|SELF_TYPE|B[oO][oO][lL]|T[rR][uU][eE]|F[aA][lL][sS][eE]|[A-Z][a-zA-Z0-9_]*)

/* CONST */
INT_CONST	[0-9]+

/* White Space */
WHITESPACE 	[ \t\r\v\f]+

/* comments */

COMMENTS_LINE			--.*[^\n]

/* OBJECT */
OBJECTID	[a-z][a-zA-Z0-9_]*

%%

 /*
  *  Nested comments
  */

"*)"			{
	return ERR_report("Unmatched *)");
}

"(*"			{
	int c;
	int last_is_star = 0;
	int last_is_bracket = 0;
	int comment_starts_left = 1;
	for ( ; ; ) 
	{
		c = yyinput();
		if (c == '\n') 
		{
			curr_lineno++;
			last_is_star = 0;
			last_is_bracket = 0;
		}
		else if (c == '*')
		{
			if (last_is_bracket) 
			{
				comment_starts_left++;	
			}
			last_is_star = 1;
			last_is_bracket = 0;
		}
		else if (c == '(')
		{
			last_is_star = 0;
			last_is_bracket = 1;
		}
		else if (c == ')')
		{
			if (last_is_star)
			{
				comment_starts_left--;
			}
			if (comment_starts_left == 0)
			{
				break;
			}
			last_is_star = 0;
			last_is_bracket = 0;
		}
		else if (c == EOF)
		{
			return ERR_report("EOF in comment");
		}
	}
}

 /*
  *  Line comments
  */

{COMMENTS_LINE}

 /*
  *  The multiple-character operators.
  */
{DARROW}		{ return (DARROW); 	}
{ASSIGN}		{ return (ASSIGN); 	}
{LE}			{ return (LE); 		}

 /*
  * operaotrs
  */
":"				{ return ':'; }
{SINGLE_OPERATOR}	{ return *((char*) yytext); }

 /*
  * Keywords are case-insensitive except for the values true and false,
  * which must begin with a lower-case letter.
  */
{CLASS}			{ return (CLASS); }
{NEW}			{ return (NEW); }
{ELSE}			{ return (ELSE); }
{FI}			{ return (FI); }
{IF}			{ return (IF); }
{IN}			{ return (IN); }
{INHERITS}		{ return (INHERITS); }
{ISVOID}		{ return (ISVOID); }
{LET}			{ return (LET); }
{LOOP}			{ return (LOOP); }
{POOL}			{ return (POOL); }
{THEN}			{ return (THEN); }
{WHILE}			{ return (WHILE); }
{CASE}			{ return (CASE); }
{ESAC}			{ return (ESAC); }
{OF}			{ return (OF); }
{NOT}			{ return (NOT); }
{TRUE}			{
	cool_yylval.boolean = true;
	return {BOOL_CONST};
}
{FALSE}			{
	cool_yylval.boolean = false;
	return {BOOL_CONST};
}

 /*
  *  TYEPID
  */
{TYPEID}		{
	BEGIN(0);
	cool_yylval.symbol = idtable.add_string(yytext);
	return (TYPEID);
}

 /*
  *  OBJECTID
  */
{OBJECTID} 		{
	cool_yylval.symbol = idtable.add_string(yytext);
	return (OBJECTID);
}

 /*
  *  constants 
  */

{INT_CONST}		{
	cool_yylval.symbol = inttable.add_string(yytext);
	return (INT_CONST);
}

 /*
  *  String constants (C syntax)
  *  Escape sequence \c is accepted for all characters c. Except for 
  *  \n \t \b \f, the result is c.
  *
  */

"\""			{
	char c;
	int error_flag = 0;
	int string_buf_ptr = 0;
	for ( ; ; )
	{
		if (string_buf_ptr >= (MAX_STR_CONST - 1))
		{
			strcpy(error_msg, "String constant too long");
			error_flag = 1;
			break;
		}
		c = yyinput();
		if (c == '\\')
		{
			c = yyinput();
			if (c == '\n')
			{
				curr_lineno++;
				string_buf[string_buf_ptr++] = '\n';
			}
			else if (c == 'b')
			{
				string_buf[string_buf_ptr++] = '\b';
			}
			else if (c == 't')
			{
				string_buf[string_buf_ptr++] = '\t';
			}
			else if (c == 'n')
			{
				string_buf[string_buf_ptr++] = '\n';
			}
			else if (c == 'f')
			{
				string_buf[string_buf_ptr++] = '\f';
			}
			else
			{
				string_buf[string_buf_ptr++] = c;
			}
		}
		else if (c == '\"')
		{
			break;
		}
		else if (c == '\n')
		{
			curr_lineno++;
			return ERR_report("Unterminated string constant");
		}
		else if (c == '\0')
		{
			strcpy(error_msg, "String contains null character");
			error_flag = 1;
			break;
		}
		else if (c == EOF)
		{
			return ERR_report("EOF in string constant");
		}
		else
		{
			string_buf[string_buf_ptr++] = c;
		}
	}
	string_buf[string_buf_ptr++] = 0;

	if (error_flag) 
	{
		while(c != '\"')
		{
			c = yyinput();
			if (c == '\n')
			{
				curr_lineno++;
			}
		}
		return ERR_report(error_msg);
	}
	else
	{
		cool_yylval.symbol = stringtable.add_string(string_buf);
		return (STR_CONST);
	}
}

 /*
  * process the lines
  */

[\n]			{ curr_lineno++; }
{WHITESPACE}

 /*
  * error lex
  */
([^" \\\:\n\t\f\r\v\+\-\*\/\~\<\=\(\)\{\}\;\!\@\#\$\%\^\&\'\>\_\.\[\]\,]+|\!|\@|\#|\$|\%|\^|\&|\'|\>|\_|\.|\[|\]|\,|\:|\\)	{
	return ERR_report(yytext);
}

%%

int ERR_report(char* msg)
{
	cool_yylval.error_msg = msg;
	return (ERROR);
}

我编了一个 test，检查可能存在的乱七八糟的输入

class TestClassA : #"#TestClassB" TestClassC
"test string \  
second line"  
"test string  
second line"  
!@#$%^&*()_+-=[]{}|\:;"'<>?,./~`  
12341231    
00102354123   
_ __ ___ ____   
_abasd _1234123   
a_ a_b      
\\\\\\\\

在加上本来就提供的 test.cl，和 reflexer 生成的 diff 了一下，发现没有区别，应该基本就可以了。然后 make mycoolc，使用 mycoolc 即可编译了，我选择了 example 中的 life.cl（康威生命游戏）编译了一下，运行，成功，cheers！

cheers!