In our previous discussions, we explored why I embarked on creating an LC-3 assembler (and why I chose to write it in C) and took a closer look at LC-3 instructions themselves. Today, we’ll dive into the high-level designs that underpin the assembler and examine the trade-offs that guided my final architectural decisions.
Note: All Python snippets here serve as illustrative pseudocode only. The actual implementation is (and will be) developed in C.
Problem (Re-) Statement
An assembler is like a language translator that converts assembly code into machine code. Consider this snippet of LC-3 assembly:
|
|
It appears simple: each line translates to a numerical opcode or machine instruction. However, forward references such as BRnz NEXT
introduce challenges. The label NEXT
doesn’t appear until later in the code, which means the assembler needs a strategy to handle symbols before it actually encounters them.
The Great Debate: One Pass vs. Two Pass
One of the first major design decisions was whether to read through the source once or twice:
One-pass assembler
It processes the source code in a single pass, resolving symbols and instructions as it goes. Forward references, however, must be handled by placing “fixups” where labels are not yet known, then revisiting them once the label definitions appear.
|
|
Although a single pass can be more memory-efficient, it can make the code more complex because you’re juggling forward references in real time.
Two-pass assembler
In contrast, the two-pass assembler reads the entire source twice. The first pass builds a comprehensive symbol table, and the second pass uses that table to generate machine code with full knowledge of all addresses.
|
|
Ultimately, I chose the two-pass approach, as it keeps symbol resolution neat and avoids the complexity of intermixing “fixup” logic with code generation.
Building the Pipeline
After deciding on two passes, I needed an overarching structure for the assembler itself. I opted for a modular pipeline architecture—think of it as a factory production line where each station has one specific job.
|
|
- Lexer: Strips comments and breaks lines into tokens.
- Parser: Converts tokens into an internal representation of the instruction (opcode + operands).
- Encoder: Turns that instruction representation into machine code.
- Writer: Outputs the machine code to the final binary or object file.
This isolation of responsibilities keeps the code more maintainable and testable.
Alternative Paths Not Taken
Event-Driven Architecture
One possible alternative is an event-driven design, where each component subscribes to and emits events as data flows between them:
|
|
While flexible and extensible, this adds more complexity than necessary for a straightforward LC-3 assembler.
Table-Driven Design
Another approach is table-driven, where all instruction formats and validation rules are stored in a data structure:
|
|
Table-driven designs can reduce “if-else” statements and make it easier to add instructions. However, managing more complex logic, such as labels and condition codes, may still require additional structures or logic.
Error Handling
A robust assembler must give clear feedback when users make mistakes. The two-pass approach makes it easier to detect and report errors during symbol resolution (first pass) and operand encoding (second pass):
|
|
Collecting all errors first and then displaying them together is much more user-friendly than stopping at every single error.
Memory Management Considerations
When implementing in C, the memory implications of our architectural choices become particularly significant. The two-pass approach requires maintaining the entire symbol table in memory, which means careful memory allocation and deallocation strategies. Here’s how our design handles this:
In the first pass, we allocate memory for the symbol table conservatively, using a simple but efficient hash table implementation that grows as needed. While this might seem like a memory overhead compared to a one-pass approach, it actually helps prevent memory fragmentation that could occur with the multiple allocations needed for fixup records in a one-pass design. The symbol table’s memory footprint is also predictable and proportional to the number of labels in the source code, making it easier to manage resource constraints on smaller systems.
Testing the Pipeline
One major advantage of our modular pipeline architecture is how naturally it lends itself to testing. Each component can be tested in isolation, which is particularly valuable when implementing in C where debugging can be more challenging. For example:
|
|
This testing structure allows us to catch issues early in the development process and makes it easier to maintain the codebase over time. Each component’s interface becomes a natural boundary for test cases, and we can verify edge cases without needing to construct complex assembly programs.
The Final Design & Conclusion
After weighing all these considerations—one-pass vs. two-pass, pipeline vs. event-driven vs. table-driven—I arrived at a two-pass solution with a modular pipeline. This approach:
- Separates symbol collection (first pass) and code generation (second pass) for clarity.
- Provides a clean “assembly line” structure where each phase has a single responsibility.
- Makes debugging easier, thanks to comprehensive error reporting.
In the next posts, we’ll dive deeper into the details of each pipeline stage—especially how labels, instructions, and error handling integrate seamlessly in C. By building on this solid architecture, the assembler remains both extensible and approachable, ensuring that future modifications or new features can be added without unravelling existing code.
The choice of C as our implementation language actually complements our architectural decisions well. The pipeline design maps naturally to C’s procedural nature, with each component implementing a clear interface through function pointers. This makes it straightforward to maintain separation of concerns while avoiding the complexity of managing virtual method tables that might come with an object-oriented approach. However, we face some challenges, particularly in string manipulation during lexing and managing dynamic memory for the symbol table. These will be addressed through careful design of the data structures and effective use of C’s standard library functions, by which I attempt to prove that even without modern language features, a well-thought-out architecture can lead to clean, maintainable code.
Thanks for reading! If you have questions about these design choices or want to share your own assembler project experiences, feel free to leave a comment. Stay tuned for the next installment, where we’ll zoom in on how the internals of each pipeline component work together in practice.