C's ctype.h Signed Char Trap: The isspace That Crashes on Café

2026-05-29

This function counts whitespace characters in a string. It works for ASCII input. Run it against a string with any non-ASCII byte and it may crash, hang, or return nonsense — and the compiler will not warn you.

#include <ctype.h>
#include <stdio.h>
#include <string.h>

int count_whitespace(const char *s) {
    int count = 0;
    for (size_t i = 0; s[i] != '\0'; i++) {
        if (isspace(s[i])) {
            count++;
        }
    }
    return count;
}

int main(void) {
    /* UTF-8: 'é' is the two bytes 0xC3 0xA9 */
    const char *text = "café au lait";
    printf("Whitespace count: %d\n", count_whitespace(text));
    return 0;
}

The Bug

Every function in <ctype.h>isspace, isalpha, toupper, all of them — takes an int, but C7.4 requires the argument's value to be either EOF or representable as unsigned char. Pass anything else and you get undefined behavior.

On most platforms (x86, x86-64), char is signed. So when you do isspace(s[i]) and s[i] holds 0xC3 (the first byte of UTF-8 é), the char sign-extends to the int value -61. That is neither EOF (typically -1) nor a valid unsigned char.

Why does this matter in practice? Most glibc-style implementations define these functions as macros that index into a lookup table:

#define isspace(c) (__ctype_b_loc()[0][(int)(c)] & _ISspace)

With c == -61, you read __ctype_b_loc()[0][-61] — out of bounds. With glibc the table has a 128-byte negative offset that masks the bug for values down to -128, so you usually get a wrong-but-survivable answer. On other libcs (musl, Windows CRT in debug mode, embedded toolchains) the same code crashes, asserts, or returns garbage. Worst of all: this is sanctioned UB, so a future compiler upgrade can break code that "worked" for a decade.

The fix is a one-character cast to unsigned char before promotion to int:

int count_whitespace(const char *s) {
    int count = 0;
    for (size_t i = 0; s[i] != '\0'; i++) {
        if (isspace((unsigned char)s[i])) {
            count++;
        }
    }
    return count;
}

The cast turns 0xC3 into 195 instead of -61. Now the argument is a valid unsigned char value, the lookup is in-bounds, and the behavior is defined (it returns 0 — 0xC3 is not whitespace in the C locale).

Three things make this trap especially mean:

The same trap applies to strtoul-style parsing loops, custom tokenizers, and any code that walks a char * and asks "is this byte a digit/letter/space?" If your input could ever contain a byte with the high bit set — UTF-8, Latin-1, binary data, a corrupted file — wrap the argument.

Key Takeaway: Every <ctype.h> call on a char needs an (unsigned char) cast — otherwise non-ASCII bytes invoke undefined behavior via sign-extension into a negative table index.

All newsletters