Imagine we’re building a translation machine that needs to understand two very different languages: the human-readable assembly code that programmers write, and the binary machine code that computers execute. This is exactly what our LC-3 assembler does, and today we’re going to explore the data structures that make this translation possible.
In our previous discussions, we explored the why behind building an LC-3 assembler (link1 link2) and dove into the high-level architectural decisions. Now it’s time to look at how we actually represent and manipulate assembly code during the translation process. While our actual implementation is in C, I’ll use Python to illustrate these concepts more clearly - think of it as using a familiar language to explain ideas that work in any programming language.
The Building Blocks: Core Data Structures
Every assembly program is essentially a sequence of text lines that need to be transformed into machine code. But before we can do that transformation, we need ways to store and manipulate these different forms of data. Let’s look at the core data structures that make this possible.
Source Line Representation: Breaking Down Assembly Code
At the heart of our assembler is how we represent each line of assembly code. Consider this line:
|
|
We need to capture the label (“START”), the operation (“ADD”), the operands (“R1, R2, R3”), and any comments. Here’s how we structure this:
|
|
Why this structure? It allows us to:
- Preserve the original line’s components for error reporting
- Process each part independently during assembly
- Maintain source line numbers for debugging
Symbol Table: The Address Book of Our Assembler
Think of assembly labels like bookmarks in a massive book - they help us quickly find specific locations in our program. But with potentially hundreds or thousands of labels, we need an efficient way to store and look them up. That’s where our symbol table comes in, implemented as a hash table for lightning-fast access:
|
|
The symbol table is crucial during both passes:
- First pass: Collect all label definitions and their addresses
- Second pass: Resolve label references to actual addresses
We chose a hash table implementation because:
- O(1) average lookup time is essential for large assembly files
- The size (1024 buckets) provides a good balance between memory usage and collision avoidance
- Simple chaining for collision resolution keeps the code straightforward
Instruction Record: The Bridge Between Assembly and Machine Code
When parsing instructions, we need an intermediate representation that bridges the gap between assembly syntax and binary machine code:
|
|
This structure is vital because it:
- Captures all information needed to generate machine code
- Maintains source file information for error reporting
- Supports different instruction formats through the
has_imm5
flag
Error Handling: Programming Safety Net
When we’re translating assembly code to machine code, a lot can go wrong - undefined labels, values out of range, malformed instructions, and more. Rather than failing at the first error, a good assembler should catch all these issues and report them clearly to the programmer. Here’s how we structure our error handling system:
|
|
This structure allows us to:
- Collect multiple errors before stopping
- Provide detailed context for each error
- Categorize errors for better user feedback
The File Handler: Managing Input
Reading assembly source requires careful file handling, especially in C where we need to manage memory manually. Here’s a simplified view:
|
|
The file handler is crucial because it:
- Manages file resources safely
- Keeps track of line numbers for error reporting
- Provides a clean interface for the two-pass process
Putting It All Together: The Data Structure Orchestra
Like musicians in an orchestra, each of our data structures plays a crucial role in transforming assembly code into machine code. They need to work together seamlessly, each performing its part at exactly the right time. Let’s revisit our pipeline diagram:
The beauty of this design is that each data structure has a single, well-defined responsibility, making the code easier to maintain and debug. For example, if we need to add support for new pseudo-ops, we only need to modify the parser and its instruction record handling, leaving other components unchanged.
Performance Considerations: Balancing Theory and Reality
When designing an assembler, we can’t just pick data structures based on theoretical performance alone. We need to consider real-world constraints like memory usage, cache behavior, and the specific patterns of assembly code. Here’s how our choices balanced these different factors:
-
Symbol Table: Hash table with chaining
- Average O(1) lookup
- Space complexity O(n) where n is number of labels
- Good cache locality within buckets
-
Error Collection: Dynamic array (vector)
- O(1) amortized append
- Sequential memory access for reporting
- Minimal memory overhead
-
Source Line: Fixed-size structure
- Predictable memory usage
- Easy to pass between pipeline stages
- Efficient stack allocation in C
Conclusion: Building a Bridge Between Human and Machine
At first glance, an assembler might seem like a simple translator - just converting text into numbers. But as we’ve seen, the reality is far more intricate. We need carefully crafted data structures to handle the nuances of assembly language, manage memory efficiently, and provide helpful feedback when things go wrong. Our design choices reflect a careful balance between multiple competing needs:
- Performance (hash tables for symbol lookup)
- Memory efficiency (fixed-size structures where possible)
- Error handling (comprehensive error collection)
- Code maintainability (clear separation of concerns)
In our next post, we’ll dive into how these data structures come into play during actual instruction encoding, where we’ll see how they help us transform assembly mnemonics into binary machine code.
Remember, good data structures are like a solid foundation—invest time in getting them right, and the rest of our code will thank us for it!