# File Compression in C: A Detailed Guide on Run-Length Encoding

Today we will be diving into the fascinating world of file compression using a Run-Length Encoding (RLE) algorithm in C.

Before we start coding, let's clarify what Run-Length Encoding (RLE) is. It's a very simple form of data compression where consecutive elements that are the same are replaced with that particular element followed by the count of that element. For instance, the string "AAAABBBCCD" would be encoded as "A4B3C2D1". This form of compression works well on data with many consecutive repeated characters.

Now, let's get started with our program, building it step by step.

# Step 1: Include Necessary Libraries

The first step in our code is to include the necessary libraries. We have fcntl.h for file control, stdio.h and stdlib.h for standard input/output and standard library functions respectively, sys/mman.h and sys/stat.h for memory management and file statistics, and unistd.h for miscellaneous symbolic constants and types.

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

# Step 2: Define the State Structure

Next, we define a structure State to keep track of the previous character and its count. This structure will be used to store the current state of the encoding process.

typedef struct {
    unsigned char previous_char;
    unsigned char count;
} State;

# Step 3: Encode Data Function

Now we get into the meat of our program - the encode_data function. This function reads in the data, and for each character, it checks if it's the same as the previous one. If it is, and the count is less than 255 (the maximum value that an unsigned char can hold), it increases the count. If it's a different character, it outputs the previous character and its count, then updates the previous character and resets the count to 1.

void encode_data(const unsigned char *data, size_t size, State *state) {
    for (size_t i = 0; i < size; i++) {
        if (data[i] == state->previous_char && state->count < 255) {
            state->count++;
        } else {
            fwrite(&(state->previous_char), 1, 1, stdout);
            fwrite(&(state->count), 1, 1, stdout);
            state->previous_char = data[i];
            state->count = 1;
        }
    }
}

# Step 4: The Main Function

We're now ready to dive into our main function. First, we initialize our State structure with zeros, indicating that we haven't encountered any characters yet.

int main() {
    State state = {0, 0};

We then open our input file with open. If the file fails to open, we print an error message and return 1 to indicate a failure. Here, our file is named "test.txt".

int fd = open("test.txt", O_RDONLY);
if (fd == -1) {
    perror("Error opening input file");
    return 1;
}

We use fstat to get the size of the file. If fstat fails, we print an error message and return 1.

struct stat sb;
if (fstat(fd, &sb) == -1) {
    perror("Error getting file size");
    return 1;
}

# Step 5: Memory Mapping

We use the mmap function to map the file into memory. This will allow us to process the file as if it were an array in memory. If mmap fails, we print an error message and return 1.

unsigned char *addr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (addr == MAP_FAILED) {
    perror("Error mapping file into memory");
    return 1;
}

# Step 6: Encoding the Data

Now we call our encode_data function to encode the file data. We pass it the memory-mapped file data, the file size, and the address of our state variable.

encode_data(addr, sb.st_size, &state);

# Step 7: Cleaning Up

Once we're done with the file data, we clean up by un-mapping the file from memory using munmap, and closing the file descriptor with close.

munmap(addr, sb.st_size);
close(fd);

# Step 8: Finalizing the Encoding

As a final step, we output the last character and count, in case they weren't already output in the encode_data function. This is to ensure we don't miss the last sequence in the file.

fwrite(&(state.previous_char), 1, 1, stdout);
fwrite(&(state.count), 1, 1, stdout);

Finally, we return 0 to indicate successful execution of the program.

    return 0;
}

That's it! We've just built a Run-Length Encoding (RLE) algorithm in C. This is a simple yet powerful technique for data compression, especially for data with many consecutive repeated characters.

I hope this tutorial has been helpful to you.

← Creating a RESTful API with Node.js and Express Exploring Functional Programming in Haskell →