Writing a Compiler for Verilog – part 2

by Mithrandir

As previously stated, I am working on a Verilog Compiler in Python using PLY. The last article talked in a generic way about the project, from now on we will talk about specific parts of the compiler.

Today, we begain with the first part of any compiler: the lexing part.

To build the lexical analyser was simple. One simply use lex or one of its equivalents for this task. In Python, the best alternative is to use the lex part of the PLY.

Our project implements all of this in one single file called lexer.py. In fact, everything is contained in just one class called VerilogLexer.

After doing some bookkeeping stuff in this class, we are ready to start the description of the language. First, we begin with a list of all Verilog keywords in order to distinguish between them and other identifiers.


    keywords = [
            'ALWAYS',
            'AND',
            'ASSIGN',
            'BEGIN',
            ...,
            'WIRE',
            'WOR',
            'XNOR',
            'XOR'
            ]

After this we will describe all language tokens in a simple list.


    tokens = keywords + [
            'ID',
            'INT_CONST_DEC', 'INT_CONST_OCT', 'INT_CONST_HEX',
            'INT_CONST_BIN',
            'FLOAT_CONST', 'FLOAT_CONST_EXP',
            'STRING_LITERAL',
            ...,
            'HASH', # #
            'DOLLAR', # $
            'ARROW', # ->      .
            'AT', # @

Then, we will have to define regular expressions for every token. Moreover, we have to define some functions to tell what should pe returned if we have a token for different values (like the ID one which stands for every identifier be it a keyword or not).

For example, the following definitions are for identifiers (ID) and comments. We also give examples of actions taken when recognising several tokens: newline, the plus operator, any identifier and the error action.


    identifier = r'[a-zA-Z_][0-9a-zA-Z$_]*'
    ...
    comment = r'(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)'
    ...
    def t_NEWLINE(self, t):
        r'\n+'
        t.lexer.lineno += t.value.count('\n')
    ...
    t_PLUS  = r'\+'
    ...
    @TOKEN(identifier)
    def t_ID(self, t):
        t.type = self.keyword_map.get(t.value, 'ID')
        #print "ID: ", t.value, t.type
        return t
    ...
    def t_error(self, t):
        msg = 'Illegal character %s' % repr(t.value[0])
        self._error(msg, t)

As you see, the regular expressions are in standardized form. The same form used in grep, sed or lex.

Defining actioncs can be made in several ways. One can define nothing to do which will mean that the token is passed to the second stage without any other action (see the t_PLUS example). Another way is to define a function for an expression that we have writen before (see the t_ID example). And we can just ignore this token in the following phases if this will not help us (see the t_NEWLINE example). For any string that doesn’t match a regular expression listed in the module for one token or another, the t_error function is called.

This step reads the entire file and returns a list of all tokens found inside it. Practically, the entire input is split in pieces and each piece is labeled with a specific name. The second stage of the compiler will look at the labels and the order of the pieces to determine what they represent.

But for this you’ll have to wait a little longer. Next two articles will be about Haskell.

About these ads