Understanding how to build a lexer (lexical analyzer) is a rite of passage for many programmers. It opens up a world of possibilities, from creating your own Domain Specific Languages (DSLs) to writing custom configuration parsers.
In this tutorial, we will break down the theory and implementation of a basic lexer using pure Python.
What is a Lexer?
From the moment you type code to the moment it executes, it undergoes several transformations. The journey typically follows these steps:
- Lexical Analysis (The Lexer): Converts the raw string of code into a list of “tokens” or “lexemes.”
- Parsing (The Parser): Takes the tokens and builds a structure (like an Abstract Syntax Tree) to understand the logic.
- Code Generation: Converts that structure into machine code or bytecode.
- Execution: The computer runs the code.
A lexer is the tool that performs step 1. It’s the eyes of the compiler; it looks at a messy string and identifies the meaningful parts.
Key Terminology
- Scanning: The process of reading the source code character by character.
- Lexeme: A sequence of characters that matches a pattern (e.g., the word
foror the symbol{). - Token: A categorized lexeme. For example, the lexeme
123might be tokenized as(NUMBER, 123). - Keyword: A lexeme that has a special, predefined meaning (e.g.,
if,while,print).
The Core Logic: How to Separate Code
Imagine we have this line of code:
for(i; i < 10; i++){ }
Our lexer needs to separate this into:
for, (, i, ;, i, <, 10, ;, i, ++, ), {, }
The Strategy
We iterate through the string character by character. We accumulate characters into a lexeme variable until we hit a “boundary”—usually a whitespace or a special symbol.
# Conceptual Pseudo-code
lexeme = ''
for char in source_code:
if char is a whitespace or a special symbol:
process(lexeme)
process(char)
lexeme = ''
else:
lexeme += char
Implementation in Python
Let’s build a simple lexer that can handle a snippet of Java-like code.
1. Defining our Keywords
# Single-character symbols that act as boundaries
SYMBOLS = ['{', '}', '(', ')', '[', ']', '.', '"', '*', '\n', ':', ',', ';', '=', ' ', '\t']
# Multi-character keywords
KEYWORDS = ['public', 'class', 'static', 'void', 'main', 'String', 'int', 'for']
2. The Lexing Loop
def simple_lexer(source_code):
lexemes = []
current_lexeme = ''
# We use enumerate to easily peek at the next character
for i, char in enumerate(source_code):
# If it's not a whitespace, add to current lexeme
if char not in [' ', '\t', '\n']:
current_lexeme += char
# Look ahead to see if we should "break" the current lexeme
if (i + 1 < len(source_code)):
next_char = source_code[i + 1]
# Condition to break:
# 1. Next char is a symbol/boundary
# 2. Current char is a symbol/boundary
# 3. We've just completed a known keyword
if next_char in SYMBOLS or char in SYMBOLS or current_lexeme in KEYWORDS:
if current_lexeme != '':
lexemes.append(current_lexeme)
current_lexeme = ''
# Handle the final lexeme
if current_lexeme:
lexemes.append(current_lexeme)
return lexemes
# Test it out
code = 'public class Test { int x = 10; }'
print(simple_lexer(code))
Challenges and Edge Cases
The basic lexer above is a great start, but it has some limitations:
- Multi-character Operators: How do we handle
++or==? Usually, we add a “lookahead” check to see if the next character combines with the current one to form a single token. - Strings: Everything inside
" "should be treated as one single lexeme, even if it contains spaces or symbols. - Comments: We need flags to identify when we enter a comment (
/*or//) so we can ignore the content until the comment ends.
Conclusion
Building a lexer manually is a fantastic way to learn the internals of programming languages. While libraries like re (Regular Expressions) can make this faster, doing it “by hand” gives you complete control over the parsing logic and a deeper appreciation for the tools we use every day.
In our next tutorial, we’ll look at Tokenization—adding types and metadata to these lexemes!
Further Reading:
Written by
Abdur-Rahmaan Janhangeer
Chef
Python author of 7+ years having worked for Python companies around the world
Suggested Posts
Building an Indentation Analyser in Python: A Comprehensive Guide
In many programming languages, like C or Java, scope is defined by curly braces {}. However, languag...
Creating Your Own Domain Specific Language (DSL) in Python
Sometimes, a standard CLI tool isn’t enough, but a full-blown programming language is overkill. This...
Simple JSON parser without knowing anything beforehands
Table of Content Introduction Scanning the text Looking for specific characters Identifying element...