Build a simple interpreter --Part 1


  今天突然想要记录一下自己的学习过程,一方面是为了以后更好的回顾,另一方面则是在记录的过程中希望看到自己的成长。因为我是根据一篇博客进行学习的,所以接下来可能会使用很多英语。

  什么是编译器(compiler)或解释器(interpreter)呢?它们的功能是什么?它们之间又有什么曲别呢?

  The goal of an interpreter or a compiler is to translate a source program in some high-level language into some other form taht a computer can recognize.For the purpose of this series,let's agree that if a translator translates a source program into machine language,it's a compiler.If a tanslator processes and executes the source program without translating it into machine language first,we call it intepreter.

  We will start our first foray into interpreters and compilers by writing a simple interpreter of arithmetic expressions,as knowns as a calculator.Today 's goal is very simple and it's to make our calculator handle the addition of two single digit integer like 1+1.Here is the source code for our interpreter or calculator:

  

  1 # Token types
  2 #
  3 # EOF (end-of-file) token is used to indicate that
  4 # there is no more input left for lexical analysis
  5 INTEGER, PLUS, EOF = 'INTEGER', 'PLUS', 'EOF'
  6 
  7 
  8 class Token(object):
  9     def __init__(self, type, value):
 10         # token type: INTEGER, PLUS, or EOF
 11         self.type = type
 12         # token value: 0, 1, 2. 3, 4, 5, 6, 7, 8, 9, '+', or None
 13         self.value = value
 14 
 15     def __str__(self):
 16         """String representation of the class instance.
 17 
 18         Examples:
 19             Token(INTEGER, 3)
 20             Token(PLUS '+')
 21         """
 22         return 'Token({type}, {value})'.format(
 23             type=self.type,
 24             value=repr(self.value)
 25         )
 26 
 27     def __repr__(self):
 28         return self.__str__()
 29 
 30 
 31 class Interpreter(object):
 32     def __init__(self, text):
 33         # client string input, e.g. "3+5"
 34         self.text = text
 35         # self.pos is an index into self.text
 36         self.pos = 0
 37         # current token instance
 38         self.current_token = None
 39 
 40     def error(self):
 41         raise Exception('Error parsing input')
 42 
 43     def get_next_token(self):
 44         """Lexical analyzer (also known as scanner or tokenizer)
 45 
 46         This method is responsible for breaking a sentence
 47         apart into tokens. One token at a time.
 48         """
 49         text = self.text
 50 
 51         # is self.pos index past the end of the self.text ?
 52         # if so, then return EOF token because there is no more
 53         # input left to convert into tokens
 54         if self.pos > len(text) - 1:
 55             return Token(EOF, None)
 56 
 57         # get a character at the position self.pos and decide
 58         # what token to create based on the single character
 59         current_char = text[self.pos]
 60 
 61         # if the character is a digit then convert it to
 62         # integer, create an INTEGER token, increment self.pos
 63         # index to point to the next character after the digit,
 64         # and return the INTEGER token
 65         if current_char.isdigit():
 66             token = Token(INTEGER, int(current_char))
 67             self.pos += 1
 68             return token
 69 
 70         if current_char == '+':
 71             token = Token(PLUS, current_char)
 72             self.pos += 1
 73             return token
 74 
 75         self.error()
 76 
 77     def eat(self, token_type):
 78         # compare the current token type with the passed token
 79         # type and if they match then "eat" the current token
 80         # and assign the next token to the self.current_token,
 81         # otherwise raise an exception.
 82         if self.current_token.type == token_type:
 83             self.current_token = self.get_next_token()
 84         else:
 85             self.error()
 86 
 87     def expr(self):
 88         """expr -> INTEGER PLUS INTEGER"""
 89         # set current token to the first token taken from the input
 90         self.current_token = self.get_next_token()
 91 
 92         # we expect the current token to be a single-digit integer
 93         left = self.current_token
 94         self.eat(INTEGER)
 95 
 96         # we expect the current token to be a '+' token
 97         op = self.current_token
 98         self.eat(PLUS)
 99 
100         # we expect the current token to be a single-digit integer
101         right = self.current_token
102         self.eat(INTEGER)
103         # after the above call the self.current_token is set to
104         # EOF token
105 
106         # at this point INTEGER PLUS INTEGER sequence of tokens
107         # has been successfully found and the method can just
108         # return the result of adding two integers, thus
109         # effectively interpreting client input
110         result = left.value + right.value
111         return result
112 
113 
114 def main():
115     while True:
116         try:
117             # To run under Python3 replace 'raw_input' call
118             # with 'input'
119             text = raw_input('calc> ')
120         except EOFError:
121             break
122         if not text:
123             continue
124         interpreter = Interpreter(text)
125         result = interpreter.expr()
126         print(result)
127 
128 
129 if __name__ == '__main__':
130     main()

  If we want our code to run successfully,we must follow some principles with our input:

      1. Only single digit integers are allowed in the input

       2.The only arithmetic operation supported at the moment is addition

       3. No whitespace characters are allowed anyevery in the input

  When we enter an expression 3+5 on the command line,we will get a string "3+5".In order for the interpreter to understand what to do with that string it first need to break the string into components called tokens.A token is a object that has a type and a value.The prograss of breaking the input string into tokens is called lexical analysis.So the first step our interpreter need to do is read the input of characters convert it into a stream of tokens.The part of interpreter taht does it is called lexical analyzer or lexer for short.

  The method get_next_token of the interpreter class is our lexical analyzer.Every time we call it,we get the next token created from the input characters passed to our interpreter.After our interpreter has access to a stream of tokens made from the input characters,the interpreter needs to do something with it:it need to find the structure in the flat stream of tokens it gets from lexer get_next_token.Our interpreter except to find the following structure in that stream:INTEGER->PLUS->INTEGER.

That is,it tries to find a squence of tokens:integer followed by a plus sign followed by an integer.

  The method responible for finding and interpreting that structure is expr.This method verifies that the sequence of tokens does indeed corresponed to the expected sequence of tokens,i.e INTEGER -> PLUS -> INTEGER.After it's successfully confirmed the structure,it generates the result by adding the value of the token on the left side  of the PLUS and the right side of the PLUS,thus successfully interpreting the arithmmetic expression we passed to the interpreter.

  The expr method itself use the helper method eat to verify the token type passed to the eat method matches the current token type.After matching the passed token type the eat method gets the next token and assigns it to current_token variable,thus effectively "eating" the currently matched token and advancing the imageinary pointer in the stream of tokens. If the structure in the stream of tokens doesn't correspond the excepted INTEGER PLUS INTEGER sequence of tokens the eat method throws an exception.

  And the next goal we want to accomplish is below :

    1 Modify the code to allow multiple-digit integers in the input,for example "12+3"

    2 Add a method that skips whitespace characters so that your calculator can handle inputs with whitespace characters like " 12 +3"

    3 Modify the code and instead of "+" handle "-" to evaluate subtractions like "7-5"