Data Structures Deep Dive: Building an LC-3 Assembler

(Caption: Data flow diagram showing how our data structures interconnect pipeline stages and provide support throughout the process.)

Imagine we’re building a translation machine that needs to understand two very different languages: the human-readable assembly code that programmers write, and the binary machine code that computers execute. This is exactly what our LC-3 assembler does, and today we’re going to explore the data structures that make this translation possible.

In our previous discussions, we explored the why behind building an LC-3 assembler (link1 link2) and dove into the high-level architectural decisions. Now it’s time to look at how we actually represent and manipulate assembly code during the translation process. While our actual implementation is in C, I’ll use Python to illustrate these concepts more clearly - think of it as using a familiar language to explain ideas that work in any programming language.

The Building Blocks: Core Data Structures

Every assembly program is essentially a sequence of text lines that need to be transformed into machine code. But before we can do that transformation, we need ways to store and manipulate these different forms of data. Let’s look at the core data structures that make this possible.

Source Line Representation: Breaking Down Assembly Code

At the heart of our assembler is how we represent each line of assembly code. Consider this line:

1

START   ADD R1, R2, R3    ; Add registers

We need to capture the label (“START”), the operation (“ADD”), the operands (“R1, R2, R3”), and any comments. Here’s how we structure this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


@dataclass
class SourceLine:
    label: str | None        # Optional label (e.g., "START")
    operation: str | None    # Operation or pseudo-op (e.g., "ADD")
    operands: str | None     # Raw operand string (e.g., "R1, R2, R3")
    comment: str | None      # Optional comment
    line_number: int         # Original source line number

    def __str__(self):
        parts = []
        if self.label:
            parts.append(f"{self.label}:")
        if self.operation:
            parts.append(self.operation)
        if self.operands:
            parts.append(self.operands)
        if self.comment:
            parts.append(f"; {self.comment}")
        return " ".join(parts)

Why this structure? It allows us to:

Preserve the original line’s components for error reporting
Process each part independently during assembly
Maintain source line numbers for debugging

Symbol Table: The Address Book of Our Assembler

Think of assembly labels like bookmarks in a massive book - they help us quickly find specific locations in our program. But with potentially hundreds or thousands of labels, we need an efficient way to store and look them up. That’s where our symbol table comes in, implemented as a hash table for lightning-fast access:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


@dataclass
class SymbolEntry:
    name: str           # Label name
    address: int        # Memory address (0x0000 to 0xFFFF)
    line_number: int    # Where it was defined
    is_defined: bool    # For catching duplicate definitions

class SymbolTable:
    def __init__(self, size=1024):
        self.size = size
        self.buckets = [[] for _ in range(size)]
    
    def _hash(self, name: str) -> int:
        hash_value = 0
        for char in name:
            hash_value = (hash_value * 31 + ord(char)) & 0xFFFFFFFF
        return hash_value % self.size
    
    def add_symbol(self, name: str, address: int, line_number: int) -> bool:
        bucket_idx = self._hash(name)
        # Check for duplicates
        for entry in self.buckets[bucket_idx]:
            if entry.name == name:
                return False
        
        # Add new entry
        self.buckets[bucket_idx].append(
            SymbolEntry(name, address, line_number, True)
        )
        return True
    
    def lookup_symbol(self, name: str) -> SymbolEntry | None:
        bucket_idx = self._hash(name)
        for entry in self.buckets[bucket_idx]:
            if entry.name == name:
                return entry
        return None

The symbol table is crucial during both passes:

First pass: Collect all label definitions and their addresses
Second pass: Resolve label references to actual addresses

We chose a hash table implementation because:

O(1) average lookup time is essential for large assembly files
The size (1024 buckets) provides a good balance between memory usage and collision avoidance
Simple chaining for collision resolution keeps the code straightforward

Instruction Record: The Bridge Between Assembly and Machine Code

When parsing instructions, we need an intermediate representation that bridges the gap between assembly syntax and binary machine code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


@dataclass
class InstructionRecord:
    opcode: int                  # 4-bit operation code
    operands: list[int]          # Up to 3 operands
    label: str | None            # Associated label for branches
    address: int                 # Memory address
    line_number: int             # Source line number
    has_imm5: bool              # Flag for immediate mode

    def encode(self) -> int:
        """Convert to 16-bit machine code."""
        code = (self.opcode & 0xF) << 12
        
        # Handle different instruction formats
        if self.opcode == 0x1:  # ADD
            if self.has_imm5:
                # Format: ADD DR, SR1, imm5
                code |= (self.operands[0] & 0x7) << 9  # DR
                code |= (self.operands[1] & 0x7) << 6  # SR1
                code |= (1 << 5)                       # Immediate mode
                code |= self.operands[2] & 0x1F        # imm5
            else:
                # Format: ADD DR, SR1, SR2
                code |= (self.operands[0] & 0x7) << 9  # DR
                code |= (self.operands[1] & 0x7) << 6  # SR1
                code |= self.operands[2] & 0x7         # SR2
        
        return code

This structure is vital because it:

Captures all information needed to generate machine code
Maintains source file information for error reporting
Supports different instruction formats through the has_imm5 flag

Error Handling: Programming Safety Net

When we’re translating assembly code to machine code, a lot can go wrong - undefined labels, values out of range, malformed instructions, and more. Rather than failing at the first error, a good assembler should catch all these issues and report them clearly to the programmer. Here’s how we structure our error handling system:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


from enum import Enum, auto

class ErrorType(Enum):
    LEXICAL = auto()    # Invalid characters, malformed tokens
    SYNTAX = auto()     # Wrong instruction format
    SEMANTIC = auto()   # Invalid operations, bad immediates
    SYMBOL = auto()     # Undefined or duplicate labels
    RANGE = auto()      # Values out of allowed ranges

@dataclass
class ErrorMessage:
    filename: str
    line_number: int
    type: ErrorType
    message: str

class ErrorHandler:
    def __init__(self):
        self.errors: list[ErrorMessage] = []
        
    def add_error(self, filename: str, line_number: int, 
                 error_type: ErrorType, message: str) -> bool:
        self.errors.append(
            ErrorMessage(filename, line_number, error_type, message)
        )
        return True
    
    def has_errors(self) -> bool:
        return len(self.errors) > 0
    
    def print_errors(self) -> None:
        for error in self.errors:
            print(f"{error.filename}:{error.line_number}: "
                  f"{error.type.name}: {error.message}")

This structure allows us to:

Collect multiple errors before stopping
Provide detailed context for each error
Categorize errors for better user feedback

The File Handler: Managing Input

Reading assembly source requires careful file handling, especially in C where we need to manage memory manually. Here’s a simplified view:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


class FileHandler:
    def __init__(self, filename: str):
        self.filename = filename
        self.current_line = 0
        self.file = None
    
    def __enter__(self):
        self.file = open(self.filename, 'r')
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.file:
            self.file.close()
    
    def read_next_line(self) -> SourceLine | None:
        if not self.file:
            return None
            
        line = self.file.readline()
        if not line:
            return None
            
        self.current_line += 1
        
        # Parse into SourceLine components
        # (label, operation, operands, comment)
        components = self._parse_line(line)
        return SourceLine(
            *components,
            line_number=self.current_line
        )
    
    def rewind(self) -> bool:
        """Reset to file beginning for second pass."""
        if not self.file:
            return False
        self.file.seek(0)
        self.current_line = 0
        return True

The file handler is crucial because it:

Manages file resources safely
Keeps track of line numbers for error reporting
Provides a clean interface for the two-pass process

Putting It All Together: The Data Structure Orchestra

Like musicians in an orchestra, each of our data structures plays a crucial role in transforming assembly code into machine code. They need to work together seamlessly, each performing its part at exactly the right time. Let’s revisit our pipeline diagram:

(Caption: Note that core data structures (SourceLine, Tokens, InstructionRecord) facilitate communication between pipeline stages while the Symbol Table and Error Handler provide cross-cutting support throughout the assembly process.)

The beauty of this design is that each data structure has a single, well-defined responsibility, making the code easier to maintain and debug. For example, if we need to add support for new pseudo-ops, we only need to modify the parser and its instruction record handling, leaving other components unchanged.

Performance Considerations: Balancing Theory and Reality

When designing an assembler, we can’t just pick data structures based on theoretical performance alone. We need to consider real-world constraints like memory usage, cache behavior, and the specific patterns of assembly code. Here’s how our choices balanced these different factors:

Symbol Table: Hash table with chaining
- Average O(1) lookup
- Space complexity O(n) where n is number of labels
- Good cache locality within buckets
Error Collection: Dynamic array (vector)
- O(1) amortized append
- Sequential memory access for reporting
- Minimal memory overhead
Source Line: Fixed-size structure
- Predictable memory usage
- Easy to pass between pipeline stages
- Efficient stack allocation in C

Conclusion: Building a Bridge Between Human and Machine

At first glance, an assembler might seem like a simple translator - just converting text into numbers. But as we’ve seen, the reality is far more intricate. We need carefully crafted data structures to handle the nuances of assembly language, manage memory efficiently, and provide helpful feedback when things go wrong. Our design choices reflect a careful balance between multiple competing needs:

Performance (hash tables for symbol lookup)
Memory efficiency (fixed-size structures where possible)
Error handling (comprehensive error collection)
Code maintainability (clear separation of concerns)

In our next post, we’ll dive into how these data structures come into play during actual instruction encoding, where we’ll see how they help us transform assembly mnemonics into binary machine code.

Remember, good data structures are like a solid foundation—invest time in getting them right, and the rest of our code will thank us for it!

The Building Blocks: Core Data Structures#

Source Line Representation: Breaking Down Assembly Code#

Symbol Table: The Address Book of Our Assembler#

Instruction Record: The Bridge Between Assembly and Machine Code#

Error Handling: Programming Safety Net#

The File Handler: Managing Input#

Putting It All Together: The Data Structure Orchestra#

Performance Considerations: Balancing Theory and Reality#

Conclusion: Building a Bridge Between Human and Machine#