In my previous post, we explored the theoretical foundations of the data structures needed for our LC-3 assembler. Today, we’ll dive into how these abstract concepts translate into actual C code. While many modern languages offer high-level abstractions and built-in data structures, implementing these in C requires us to get our hands dirty with manual memory management and careful pointer manipulation.
graph TD
A[Header Files] -->|Defines Interfaces| B[Data Structures]
B --> C[Memory Management]
B --> D[Implementation]
C --> E[Allocation]
C --> F[Deallocation]
D --> G[Core Functions]
D --> H[Error Handling]
(Caption: The relationship between our interfaces, data structures, and their implementations in C)
Let’s start with how we define our data structures. In C, we typically declare our structures and their interfaces in header (.h
) files. This separation of interface and implementation is crucial for maintainability. Here’s how we defined our core structures:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
// common/data_structures.h
typedef struct {
char* name; // label name (dynamically allocated)
uint16_t address; // memory address (0x0000 to 0xFFFF)
int line_number; // source line number
bool is_defined; // defined vs referenced
} symbol_entry_t;
typedef struct {
uint16_t opcode; // 4-bit operation code
uint16_t operands[3]; // up to 3 operands
char* label; // associated label
uint16_t address; // memory address
int line_number; // source line number
bool has_imm5; // immediate mode flag
} instruction_record_t;
typedef struct {
char* label; // optional label
char* operation; // operation or pseudo-op
char* operands; // raw operands string
char* comment; // optional comment
int line_number; // line number
} source_line_t;
|
Why do we use typedef struct
? This is a C idiom that allows us to refer to our structure types without having to write struct
every time. The _t
suffix is a common convention indicating that this is a type definition.
Memory Management: The C Way#
One of the biggest differences when implementing data structures in C versus higher-level languages is manual memory management. Let’s look at how we handle this for our symbol table:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
// symbol_table.c
symbol_table_t* create_symbol_table(void) {
symbol_table_t* st = malloc(sizeof(symbol_table_t));
if (st == NULL) return NULL; // malloc can fail!
st->size = MAX_SYMBOLS;
st->buckets = calloc(st->size, sizeof(symbol_node_t*));
if (st->buckets == NULL) {
free(st); // Clean up if second allocation fails
return NULL;
}
return st;
}
void free_symbol_table(symbol_table_t* st) {
if (st == NULL) return;
for (size_t i = 0; i < st->size; i++) {
symbol_node_t* node = st->buckets[i];
while (node != NULL) {
symbol_node_t* temp = node;
node = node->next;
free(temp->entry.name); // Free the label name string
free(temp); // Then free the node itself
}
}
free(st->buckets);
free(st);
}
|
A few key points about memory management in C:
-
Always check malloc returns: Unlike languages with exceptions, C’s malloc
silently returns NULL
on failure.
-
Clean up in reverse order: When freeing memory, we do it in the reverse order of allocation to avoid dangling pointers.
-
Null checks everywhere: Always check for NULL before dereferencing pointers.
String Handling: A Special Challenge#
In C, strings are just arrays of characters terminated by a null byte (\0
). This means we need to be extra careful when handling strings:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
// file_handler.c
bool read_next_line(file_handler_t* fh, source_line_t* source_line) {
char buffer[MAX_LINE_LENGTH];
if (fgets(buffer, MAX_LINE_LENGTH, fh->file) == NULL)
return false;
// Trim trailing whitespace
char* end = buffer + strlen(buffer) - 1;
while (end > buffer && isspace((unsigned char)*end)) {
*end = '\0';
end--;
}
// Make our own copy of the string
source_line->operation = strdup(buffer);
if (source_line->operation == NULL)
return false; // strdup can fail
return true;
}
|
Why cast to unsigned char
when using isspace
? This is to handle extended ASCII characters correctly on some platforms. It’s these little platform-specific details that make C programming both challenging and rewarding.
Error Handling: Without Exceptions#
C doesn’t have exception handling, so we need to design our error handling carefully. Here’s our approach:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
// error_handler.c
bool add_error(error_handler_t* eh, const char* filename,
int line_number, error_type type, const char* message) {
if (eh->count >= eh->capacity) {
size_t new_capacity = eh->capacity * 2;
error_message_t* new_errors = realloc(eh->errors,
new_capacity * sizeof(error_message_t));
if (new_errors == NULL) return false;
eh->errors = new_errors;
eh->capacity = new_capacity;
}
eh->errors[eh->count].filename = strdup(filename);
eh->errors[eh->count].line_number = line_number;
eh->errors[eh->count].type = type;
eh->errors[eh->count].message = strdup(message);
eh->count++;
return true;
}
|
Note how we:
- Use boolean return values to indicate success/failure
- Grow our error array dynamically
- Make copies of strings to prevent lifetime issues
Data Structure Implementation: Symbol Table#
Let’s look at how we implemented our hash table-based symbol table:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
// symbol_table.c
bool add_symbol(symbol_table_t* st, const char* name,
uint16_t address, int line_number) {
if (st == NULL || name == NULL) return false;
uint32_t index = hash(name, st->size);
// Check for duplicates
symbol_node_t* node = st->buckets[index];
while (node != NULL) {
if (strcmp(node->entry.name, name) == 0)
return false; // Already exists
node = node->next;
}
// Create new node
symbol_node_t* new_node = malloc(sizeof(symbol_node_t));
if (new_node == NULL) return false;
// Make our own copy of the name
new_node->entry.name = strdup(name);
if (new_node->entry.name == NULL) {
free(new_node);
return false;
}
// Initialize other fields
new_node->entry.address = address;
new_node->entry.line_number = line_number;
new_node->entry.is_defined = true;
// Add to front of chain
new_node->next = st->buckets[index];
st->buckets[index] = new_node;
return true;
}
|
Some important C-specific considerations here:
-
Defensive Programming: We check all pointers before using them.
-
String Ownership: We make our own copy of the name string using strdup
.
-
Cleanup on Failure: If any allocation fails, we clean up what we’ve already allocated.
Testing in C: Unity Framework#
Testing C code requires a different approach than in higher-level languages. We use the Unity testing framework:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
// test_symbol_table.c
void test_add_symbol_valid(void) {
symbol_table_t* st = create_symbol_table();
TEST_ASSERT_NOT_NULL(st);
bool ok = add_symbol(st, "LOOP", 0x3000, 1);
TEST_ASSERT_TRUE(ok);
symbol_entry_t* entry;
bool found = lookup_symbol(st, "LOOP", &entry);
TEST_ASSERT_TRUE(found);
TEST_ASSERT_EQUAL_STRING("LOOP", entry->name);
TEST_ASSERT_EQUAL_UINT16(0x3000, entry->address);
free_symbol_table(st);
}
|
Note how we:
- Always test memory allocation failures
- Check for memory leaks
- Test edge cases explicitly
Putting It All Together#
All these data structures work together in our main assembler loop:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
// main.c
int main(int argc, char* argv[]) {
file_handler_t* fh = create_file_handler(argv[1]);
symbol_table_t* st = create_symbol_table();
error_handler_t* eh = create_error_handler();
if (!fh || !st || !eh) {
// Handle initialization failure
goto cleanup;
}
// First pass: collect symbols
source_line_t line;
while (read_next_line(fh, &line)) {
if (line.label) {
if (!add_symbol(st, line.label, current_address,
line.line_number)) {
add_error(eh, argv[1], line.line_number,
SYMBOL_ERROR, "Duplicate symbol");
goto cleanup;
}
}
// Process rest of line...
}
// More processing...
cleanup:
free_file_handler(fh);
free_symbol_table(st);
free_error_handler(eh);
return has_errors(eh) ? EXIT_FAILURE : EXIT_SUCCESS;
}
|
Notice the use of goto cleanup
for error handling. While goto
is often considered harmful, it’s a common and accepted pattern in C for cleanup in error cases.
Conclusion: C is beautiful.. at cost#
While implementing data structures in C requires more careful attention to detail than in higher-level languages, it offers several advantages:
- Complete Control: We know exactly how our memory is being used.
- Performance: No hidden allocations or abstractions.
- Portability: C code can run almost anywhere (simple compiler structure, minimal memory requirement, support for almost any platforms) while the caveat is that “code once run everywhere” may not work.
Let’s explore that last point about portability. Despite C’s excellent portability, developers often encounter platform-specific challenges:
- Integer Sizes:
int
might be 32 bits on one platform and 64 on another. Always use fixed-width types (like uint32_t
) when size matters.
1
2
3
4
5
|
// Don't do this
int potentially_different_size; // Size varies by platform
// Do this instead
uint32_t guaranteed_32_bits; // Same size everywhere
|
- Byte Ordering: Different platforms handle byte ordering differently:
1
2
3
4
5
6
7
8
9
|
// This might work differently on big-endian vs little-endian systems
uint16_t value = 0x1234;
uint8_t* bytes = (uint8_t*)&value;
// Better approach: explicit byte handling
uint8_t bytes[2] = {
(value >> 8) & 0xFF, // Most significant byte
value & 0xFF // Least significant byte
};
|
1
2
3
4
5
6
|
// Might work on x86, fail on ARM
char buffer[sizeof(int)];
int* potentially_misaligned = (int*)buffer;
// Better: use aligned types or explicit alignment
alignas(alignof(int)) char aligned_buffer[sizeof(int)];
|
The key to successful C programming (as long as I found so far) is therefore to:
- Design your interfaces carefully
- Be consistent with memory management
- Use clear patterns for error handling
- Test thoroughly, including memory management
- Be aware of platform-specific assumptions
- Document any platform dependencies
- Use platform-independent types and constructs when possible
In our next post, we’ll look at how we use these data structures to implement the actual instruction encoding phase of our assembler, keeping in mind these cross-platform considerations.
Lessons learned: when working in C, assume nothing, check everything, always clean up after ourselves, and never take platform compatibility for granted!