Implementing Interfaces for an LC-3 Assembler in C

In my previous post, we explored the theoretical foundations of the data structures needed for our LC-3 assembler. Today, we’ll dive into how these abstract concepts translate into actual C code. While many modern languages offer high-level abstractions and built-in data structures, implementing these in C requires us to get our hands dirty with manual memory management and careful pointer manipulation.

graph TD A[Header Files] -->|Defines Interfaces| B[Data Structures] B --> C[Memory Management] B --> D[Implementation] C --> E[Allocation] C --> F[Deallocation] D --> G[Core Functions] D --> H[Error Handling]

(Caption: The relationship between our interfaces, data structures, and their implementations in C)

The Interface Layer: Header Files

Let’s start with how we define our data structures. In C, we typically declare our structures and their interfaces in header (.h) files. This separation of interface and implementation is crucial for maintainability. Here’s how we defined our core structures:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


// common/data_structures.h
typedef struct {
    char* name;         // label name (dynamically allocated)
    uint16_t address;   // memory address (0x0000 to 0xFFFF)
    int line_number;    // source line number
    bool is_defined;    // defined vs referenced
} symbol_entry_t;

typedef struct {
    uint16_t opcode;           // 4-bit operation code
    uint16_t operands[3];      // up to 3 operands
    char* label;               // associated label
    uint16_t address;          // memory address
    int line_number;           // source line number
    bool has_imm5;            // immediate mode flag
} instruction_record_t;

typedef struct {
    char* label;      // optional label
    char* operation;  // operation or pseudo-op
    char* operands;   // raw operands string
    char* comment;    // optional comment
    int line_number;  // line number
} source_line_t;

Why do we use typedef struct? This is a C idiom that allows us to refer to our structure types without having to write struct every time. The _t suffix is a common convention indicating that this is a type definition.

Memory Management: The C Way

One of the biggest differences when implementing data structures in C versus higher-level languages is manual memory management. Let’s look at how we handle this for our symbol table:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


// symbol_table.c
symbol_table_t* create_symbol_table(void) {
    symbol_table_t* st = malloc(sizeof(symbol_table_t));
    if (st == NULL) return NULL;  // malloc can fail!
    
    st->size = MAX_SYMBOLS;
    st->buckets = calloc(st->size, sizeof(symbol_node_t*));
    if (st->buckets == NULL) {
        free(st);  // Clean up if second allocation fails
        return NULL;
    }

    return st;
}

void free_symbol_table(symbol_table_t* st) {
    if (st == NULL) return;

    for (size_t i = 0; i < st->size; i++) {
        symbol_node_t* node = st->buckets[i];
        while (node != NULL) {
            symbol_node_t* temp = node;
            node = node->next;
            free(temp->entry.name);  // Free the label name string
            free(temp);              // Then free the node itself
        }
    }

    free(st->buckets);
    free(st);
}

A few key points about memory management in C:

Always check malloc returns: Unlike languages with exceptions, C’s malloc silently returns NULL on failure.
Clean up in reverse order: When freeing memory, we do it in the reverse order of allocation to avoid dangling pointers.
Null checks everywhere: Always check for NULL before dereferencing pointers.

String Handling: A Special Challenge

In C, strings are just arrays of characters terminated by a null byte (\0). This means we need to be extra careful when handling strings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


// file_handler.c
bool read_next_line(file_handler_t* fh, source_line_t* source_line) {
    char buffer[MAX_LINE_LENGTH];
    if (fgets(buffer, MAX_LINE_LENGTH, fh->file) == NULL) 
        return false;
    
    // Trim trailing whitespace
    char* end = buffer + strlen(buffer) - 1;
    while (end > buffer && isspace((unsigned char)*end)) {
        *end = '\0';
        end--;
    }
    
    // Make our own copy of the string
    source_line->operation = strdup(buffer);
    if (source_line->operation == NULL) 
        return false;  // strdup can fail
    
    return true;
}

Why cast to unsigned char when using isspace? This is to handle extended ASCII characters correctly on some platforms. It’s these little platform-specific details that make C programming both challenging and rewarding.

Error Handling: Without Exceptions

C doesn’t have exception handling, so we need to design our error handling carefully. Here’s our approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


// error_handler.c
bool add_error(error_handler_t* eh, const char* filename, 
               int line_number, error_type type, const char* message) {
    if (eh->count >= eh->capacity) {
        size_t new_capacity = eh->capacity * 2;
        error_message_t* new_errors = realloc(eh->errors, 
            new_capacity * sizeof(error_message_t));
        if (new_errors == NULL) return false;
        
        eh->errors = new_errors;
        eh->capacity = new_capacity;
    }
    
    eh->errors[eh->count].filename = strdup(filename);
    eh->errors[eh->count].line_number = line_number;
    eh->errors[eh->count].type = type;
    eh->errors[eh->count].message = strdup(message);
    
    eh->count++;
    return true;
}

Note how we:

Use boolean return values to indicate success/failure
Grow our error array dynamically
Make copies of strings to prevent lifetime issues

Data Structure Implementation: Symbol Table

Let’s look at how we implemented our hash table-based symbol table:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


// symbol_table.c
bool add_symbol(symbol_table_t* st, const char* name, 
                uint16_t address, int line_number) {
    if (st == NULL || name == NULL) return false;

    uint32_t index = hash(name, st->size);
    
    // Check for duplicates
    symbol_node_t* node = st->buckets[index];
    while (node != NULL) {
        if (strcmp(node->entry.name, name) == 0) 
            return false;  // Already exists
        node = node->next;
    }

    // Create new node
    symbol_node_t* new_node = malloc(sizeof(symbol_node_t));
    if (new_node == NULL) return false;

    // Make our own copy of the name
    new_node->entry.name = strdup(name);
    if (new_node->entry.name == NULL) {
        free(new_node);
        return false;
    }

    // Initialize other fields
    new_node->entry.address = address;
    new_node->entry.line_number = line_number;
    new_node->entry.is_defined = true;

    // Add to front of chain
    new_node->next = st->buckets[index];
    st->buckets[index] = new_node;

    return true;
}

Some important C-specific considerations here:

Defensive Programming: We check all pointers before using them.
String Ownership: We make our own copy of the name string using strdup.
Cleanup on Failure: If any allocation fails, we clean up what we’ve already allocated.

Testing in C: Unity Framework

Testing C code requires a different approach than in higher-level languages. We use the Unity testing framework:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


// test_symbol_table.c
void test_add_symbol_valid(void) {
    symbol_table_t* st = create_symbol_table();
    TEST_ASSERT_NOT_NULL(st);

    bool ok = add_symbol(st, "LOOP", 0x3000, 1);
    TEST_ASSERT_TRUE(ok);

    symbol_entry_t* entry;
    bool found = lookup_symbol(st, "LOOP", &entry);
    TEST_ASSERT_TRUE(found);
    TEST_ASSERT_EQUAL_STRING("LOOP", entry->name);
    TEST_ASSERT_EQUAL_UINT16(0x3000, entry->address);

    free_symbol_table(st);
}

Note how we:

Always test memory allocation failures
Check for memory leaks
Test edge cases explicitly

Putting It All Together

All these data structures work together in our main assembler loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


// main.c
int main(int argc, char* argv[]) {
    file_handler_t* fh = create_file_handler(argv[1]);
    symbol_table_t* st = create_symbol_table();
    error_handler_t* eh = create_error_handler();
    
    if (!fh || !st || !eh) {
        // Handle initialization failure
        goto cleanup;
    }

    // First pass: collect symbols
    source_line_t line;
    while (read_next_line(fh, &line)) {
        if (line.label) {
            if (!add_symbol(st, line.label, current_address, 
                          line.line_number)) {
                add_error(eh, argv[1], line.line_number, 
                         SYMBOL_ERROR, "Duplicate symbol");
                goto cleanup;
            }
        }
        // Process rest of line...
    }

    // More processing...

cleanup:
    free_file_handler(fh);
    free_symbol_table(st);
    free_error_handler(eh);
    return has_errors(eh) ? EXIT_FAILURE : EXIT_SUCCESS;
}

Notice the use of goto cleanup for error handling. While goto is often considered harmful, it’s a common and accepted pattern in C for cleanup in error cases.

Conclusion: C is beautiful.. at cost

While implementing data structures in C requires more careful attention to detail than in higher-level languages, it offers several advantages:

Complete Control: We know exactly how our memory is being used.
Performance: No hidden allocations or abstractions.
Portability: C code can run almost anywhere (simple compiler structure, minimal memory requirement, support for almost any platforms) while the caveat is that “code once run everywhere” may not work.

Let’s explore that last point about portability. Despite C’s excellent portability, developers often encounter platform-specific challenges:

Integer Sizes: int might be 32 bits on one platform and 64 on another. Always use fixed-width types (like uint32_t) when size matters.

1
2
3
4
5


// Don't do this
int potentially_different_size;  // Size varies by platform

// Do this instead
uint32_t guaranteed_32_bits;    // Same size everywhere

Byte Ordering: Different platforms handle byte ordering differently:

1
2
3
4
5
6
7
8
9


// This might work differently on big-endian vs little-endian systems
uint16_t value = 0x1234;
uint8_t* bytes = (uint8_t*)&value;

// Better approach: explicit byte handling
uint8_t bytes[2] = {
    (value >> 8) & 0xFF,  // Most significant byte
    value & 0xFF          // Least significant byte
};

Compiler Variations: Different compilers might:
- Pack structures differently
- Have different preprocessor behaviors
- Implement undefined behavior differently
- Support different C standard versions/features
Memory Alignment: Some platforms have strict alignment requirements:

1
2
3
4
5
6


// Might work on x86, fail on ARM
char buffer[sizeof(int)];
int* potentially_misaligned = (int*)buffer;

// Better: use aligned types or explicit alignment
alignas(alignof(int)) char aligned_buffer[sizeof(int)];

The key to successful C programming (as long as I found so far) is therefore to:

Design your interfaces carefully
Be consistent with memory management
Use clear patterns for error handling
Test thoroughly, including memory management
Be aware of platform-specific assumptions
Document any platform dependencies
Use platform-independent types and constructs when possible

In our next post, we’ll look at how we use these data structures to implement the actual instruction encoding phase of our assembler, keeping in mind these cross-platform considerations.

Lessons learned: when working in C, assume nothing, check everything, always clean up after ourselves, and never take platform compatibility for granted!

The Interface Layer: Header Files#

Memory Management: The C Way#

String Handling: A Special Challenge#

Error Handling: Without Exceptions#

Data Structure Implementation: Symbol Table#

Testing in C: Unity Framework#

Putting It All Together#

Conclusion: C is beautiful.. at cost#