Architectural Choices in Building LC-3 Assembler

flowchart LR %% Two-Pass Assembler with Modular Pipeline subgraph Pass1[First Pass] direction TB A1[(Source File)] --> B1[Lexer] B1 --> C1[Parser: Label Collection] C1 --> D1[(Symbol Table)] end subgraph Pass2[Second Pass] direction TB A2[(Symbol Table)] --> B2[Lexer] B2 --> C2[Parser: Build Instructions] C2 --> D2[Encoder] D2 --> E2[Writer] E2 --> F2[(Machine Code Output)] end %% Layout passes side by side Pass1 --> Pass2

(Caption: General Structure of Two-pass and Modular Assembler Design)

In our previous discussions, we explored why I embarked on creating an LC-3 assembler (and why I chose to write it in C) and took a closer look at LC-3 instructions themselves. Today, we’ll dive into the high-level designs that underpin the assembler and examine the trade-offs that guided my final architectural decisions.

Note: All Python snippets here serve as illustrative pseudocode only. The actual implementation is (and will be) developed in C.

Problem (Re-) Statement

An assembler is like a language translator that converts assembly code into machine code. Consider this snippet of LC-3 assembly:

1
2
3
4
5
6
7


        .ORIG x3000
START   ADD R1, R2, R3
        BRnz NEXT
        ADD R4, R4, #1
NEXT    AND R1, R1, #0
        JSR START
        .END

It appears simple: each line translates to a numerical opcode or machine instruction. However, forward references such as BRnz NEXT introduce challenges. The label NEXT doesn’t appear until later in the code, which means the assembler needs a strategy to handle symbols before it actually encounters them.

The Great Debate: One Pass vs. Two Pass

One of the first major design decisions was whether to read through the source once or twice:

One-pass assembler
It processes the source code in a single pass, resolving symbols and instructions as it goes. Forward references, however, must be handled by placing “fixups” where labels are not yet known, then revisiting them once the label definitions appear.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


class OnePassAssembler:
    def __init__(self):
        self.fixups = []  # Unresolved references
        self.symbols = {}  # Known symbols
        self.current_address = 0
        
    def assemble_one_pass(self, source_file):
        for line in source_file:
            if self.is_label(line):
                label = self.extract_label(line)
                self.symbols[label] = self.current_address
                self.resolve_fixups(label)
                
            if self.is_instruction(line):
                if self.needs_label_resolution(line):
                    label = self.get_referenced_label(line)
                    if label not in self.symbols:
                        self.fixups.append({
                            'address': self.current_address,
                            'label': label,
                            'instruction': line
                        })
                        self.emit_placeholder()
                    else:
                        self.encode_and_emit(line, self.symbols[label])
                
                self.current_address += self.instruction_size(line)

Although a single pass can be more memory-efficient, it can make the code more complex because you’re juggling forward references in real time.

Two-pass assembler
In contrast, the two-pass assembler reads the entire source twice. The first pass builds a comprehensive symbol table, and the second pass uses that table to generate machine code with full knowledge of all addresses.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


class TwoPassAssembler:
    def __init__(self):
        self.symbols = {}
        self.current_address = 0
        
    def first_pass(self, source_file):
        """Gather all labels and their addresses."""
        for line in source_file:
            if self.is_label(line):
                label = self.extract_label(line)
                self.symbols[label] = self.current_address
            
            self.current_address += self.instruction_size(line)
    
    def second_pass(self, source_file):
        """Generate code with all symbols known."""
        self.current_address = 0  # Reset for the second pass
        for line in source_file:
            if self.is_instruction(line):
                machine_code = self.encode_instruction(line, self.symbols)
                self.emit_code(machine_code)
            self.current_address += self.instruction_size(line)

Ultimately, I chose the two-pass approach, as it keeps symbol resolution neat and avoids the complexity of intermixing “fixup” logic with code generation.

Building the Pipeline

After deciding on two passes, I needed an overarching structure for the assembler itself. I opted for a modular pipeline architecture—think of it as a factory production line where each station has one specific job.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


class AssemblerPipeline:
    def __init__(self):
        self.file_handler = FileHandler()
        self.lexer = Lexer()
        self.parser = Parser()
        self.encoder = Encoder()
        self.writer = ObjectWriter()
    
    def assemble(self, source_file, output_file):
        # First Pass
        with self.file_handler.open(source_file) as source:
            for line in source:
                tokens = self.lexer.tokenize(line)
                if self.parser.is_label(tokens):
                    self.collect_symbol(tokens)
        
        # Second Pass
        with self.file_handler.open(source_file) as source:
            for line in source:
                tokens = self.lexer.tokenize(line)
                instruction = self.parser.parse(tokens)
                if instruction:
                    machine_code = self.encoder.encode(instruction)
                    self.writer.write(output_file, machine_code)

class Lexer:
    def tokenize(self, line):
        """Split line into tokens."""
        line = self.remove_comments(line)
        return [token for token in re.split(r'[,\s]+', line) if token]

class Parser:
    def parse(self, tokens):
        """Convert tokens into an internal instruction representation."""
        if not tokens:
            return None
            
        instruction = Instruction()
        instruction.opcode = self.get_opcode(tokens[0])
        instruction.operands = self.parse_operands(tokens[1:])
        return instruction

Lexer: Strips comments and breaks lines into tokens.
Parser: Converts tokens into an internal representation of the instruction (opcode + operands).
Encoder: Turns that instruction representation into machine code.
Writer: Outputs the machine code to the final binary or object file.

This isolation of responsibilities keeps the code more maintainable and testable.

Alternative Paths Not Taken

Event-Driven Architecture

One possible alternative is an event-driven design, where each component subscribes to and emits events as data flows between them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


class EventDrivenAssembler:
    def __init__(self):
        self.event_bus = EventBus()
        self.setup_handlers()
    
    def setup_handlers(self):
        self.event_bus.subscribe('line_read', self.lexer.handle_line)
        self.event_bus.subscribe('tokens_ready', self.parser.handle_tokens)
        self.event_bus.subscribe('instruction_ready', self.encoder.handle_instruction)
        self.event_bus.subscribe('code_ready', self.writer.handle_code)
    
    def assemble(self, source_file):
        for line in source_file:
            self.event_bus.emit('line_read', line)

While flexible and extensible, this adds more complexity than necessary for a straightforward LC-3 assembler.

Table-Driven Design

Another approach is table-driven, where all instruction formats and validation rules are stored in a data structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


INSTRUCTION_TABLE = {
    'ADD': {
        'format': ['DR', 'SR1', 'SR2'],
        'opcode': 0x1,
        'validate': lambda ops: all(0 <= reg <= 7 for reg in ops),
        'encode': lambda dr, sr1, sr2: (0x1 << 12) | (dr << 9) | (sr1 << 6) | sr2
    },
    'AND': {
        'format': ['DR', 'SR1', 'SR2'],
        'opcode': 0x5,
        'validate': lambda ops: all(0 <= reg <= 7 for reg in ops),
        'encode': lambda dr, sr1, sr2: (0x5 << 12) | (dr << 9) | (sr1 << 6) | sr2
    }
}

Table-driven designs can reduce “if-else” statements and make it easier to add instructions. However, managing more complex logic, such as labels and condition codes, may still require additional structures or logic.

Error Handling

A robust assembler must give clear feedback when users make mistakes. The two-pass approach makes it easier to detect and report errors during symbol resolution (first pass) and operand encoding (second pass):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


class AssemblerError:
    def __init__(self, line_number, message, error_type):
        self.line_number = line_number
        self.message = message
        self.error_type = error_type

class ErrorCollector:
    def __init__(self):
        self.errors = []
        
    def add_error(self, line_number, message, error_type):
        self.errors.append(AssemblerError(line_number, message, error_type))
        
    def has_errors(self):
        return len(self.errors) > 0
        
    def report_errors(self):
        for error in self.errors:
            print(f"Line {error.line_number}: {error.error_type}: {error.message}")

class Assembler:
    def __init__(self):
        self.error_collector = ErrorCollector()
    
    def assemble(self, source_file):
        # First pass: Collect symbols and syntax errors
        self.first_pass(source_file)
        
        # Don't proceed if errors were found
        if self.error_collector.has_errors():
            self.error_collector.report_errors()
            return False
            
        # Second pass: Encode instructions and check semantic errors
        self.second_pass(source_file)
        
        if self.error_collector.has_errors():
            self.error_collector.report_errors()
            return False
            
        return True

Collecting all errors first and then displaying them together is much more user-friendly than stopping at every single error.

Memory Management Considerations

When implementing in C, the memory implications of our architectural choices become particularly significant. The two-pass approach requires maintaining the entire symbol table in memory, which means careful memory allocation and deallocation strategies. Here’s how our design handles this:

In the first pass, we allocate memory for the symbol table conservatively, using a simple but efficient hash table implementation that grows as needed. While this might seem like a memory overhead compared to a one-pass approach, it actually helps prevent memory fragmentation that could occur with the multiple allocations needed for fixup records in a one-pass design. The symbol table’s memory footprint is also predictable and proportional to the number of labels in the source code, making it easier to manage resource constraints on smaller systems.

Testing the Pipeline

One major advantage of our modular pipeline architecture is how naturally it lends itself to testing. Each component can be tested in isolation, which is particularly valuable when implementing in C where debugging can be more challenging. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


void test_lexer() {
    char* test_line = "    ADD R1, R2, R3    ; comment";
    Token* tokens = lexer_tokenize(test_line);
    assert(strcmp(tokens[0].value, "ADD") == 0);
    assert(tokens[0].type == TOKEN_INSTRUCTION);
    // ... more assertions
}

void test_parser() {
    Token tokens[] = {
        {.value = "ADD", .type = TOKEN_INSTRUCTION},
        {.value = "R1", .type = TOKEN_REGISTER},
        // ... more tokens
    };
    Instruction* inst = parser_parse(tokens);
    assert(inst->opcode == OP_ADD);
    // ... more assertions
}

This testing structure allows us to catch issues early in the development process and makes it easier to maintain the codebase over time. Each component’s interface becomes a natural boundary for test cases, and we can verify edge cases without needing to construct complex assembly programs.

The Final Design & Conclusion

After weighing all these considerations—one-pass vs. two-pass, pipeline vs. event-driven vs. table-driven—I arrived at a two-pass solution with a modular pipeline. This approach:

Separates symbol collection (first pass) and code generation (second pass) for clarity.
Provides a clean “assembly line” structure where each phase has a single responsibility.
Makes debugging easier, thanks to comprehensive error reporting.

In the next posts, we’ll dive deeper into the details of each pipeline stage—especially how labels, instructions, and error handling integrate seamlessly in C. By building on this solid architecture, the assembler remains both extensible and approachable, ensuring that future modifications or new features can be added without unravelling existing code.

The choice of C as our implementation language actually complements our architectural decisions well. The pipeline design maps naturally to C’s procedural nature, with each component implementing a clear interface through function pointers. This makes it straightforward to maintain separation of concerns while avoiding the complexity of managing virtual method tables that might come with an object-oriented approach. However, we face some challenges, particularly in string manipulation during lexing and managing dynamic memory for the symbol table. These will be addressed through careful design of the data structures and effective use of C’s standard library functions, by which I attempt to prove that even without modern language features, a well-thought-out architecture can lead to clean, maintainable code.

Thanks for reading! If you have questions about these design choices or want to share your own assembler project experiences, feel free to leave a comment. Stay tuned for the next installment, where we’ll zoom in on how the internals of each pipeline component work together in practice.

Problem (Re-) Statement#

The Great Debate: One Pass vs. Two Pass#

Building the Pipeline#

Alternative Paths Not Taken#

Event-Driven Architecture#

Table-Driven Design#

Error Handling#

Memory Management Considerations#

Testing the Pipeline#

The Final Design & Conclusion#