memcpy Overlap Trap: When SIMD Eats Your String2026-05-13
This function strips every occurrence of a character from a string in place. It compiled cleanly, passed unit tests on the developer's laptop, and shipped to a fleet of newer servers — where it started producing garbled output. What's wrong?
#include <string.h>
#include <stdio.h>
/* Remove every occurrence of c from s, in place. */
void remove_char(char *s, char c) {
size_t len = strlen(s);
for (size_t i = 0; i < len; i++) {
if (s[i] == c) {
/* shift the tail left by one, including the '\0' */
memcpy(&s[i], &s[i + 1], len - i);
len--;
i--;
}
}
}
int main(void) {
char buf[] = "hello world";
remove_char(buf, 'l');
printf("[%s]\n", buf); /* expected: [heo word] */
return 0;
}
The source and destination passed to memcpy overlap: we're copying &s[i+1] down to &s[i], a one-byte slide. Section 7.24.2.1 of the C standard is explicit: if the regions overlap, behavior is undefined. Not "implementation-defined," not "works if you're careful" — undefined.
For decades this code appeared to work because glibc's memcpy was a byte-by-byte forward loop, which happens to do the right thing when copying down by one. Developers wrote production code on this accident. Then glibc grew SIMD implementations (SSE, AVX, ERMS-optimized rep movsb) that read a 16- or 32-byte chunk from the source before writing it to the destination. With a one-byte overlap, the second chunk's source bytes have already been overwritten by the first chunk's write. The output becomes mojibake — or, with bigger overlaps, repeats a block of source bytes verbatim.
It gets worse. The compiler is allowed to assume the regions don't overlap. GCC and Clang can vectorize, reorder, or replace the call entirely under that assumption. Sanitizers like UBSan and Valgrind's memcheck will flag the overlap, but only if you run them.
The fix is the function that exists precisely for this case: memmove. It detects overlap and chooses copy direction (or buffers internally) to produce the result as if the source were copied to a temporary first.
void remove_char(char *s, char c) {
size_t len = strlen(s);
for (size_t i = 0; i < len; i++) {
if (s[i] == c) {
memmove(&s[i], &s[i + 1], len - i);
len--;
i--;
}
}
}
Two related traps lurk nearby. First, strcpy(dst, dst + 1) has the same defect — the C library's string functions also forbid overlap unless their name says otherwise (memmove is the lone exception in the classic set). Second, this whole algorithm is O(n²); a single-pass read/write cursor would be both correct and faster:
void remove_char(char *s, char c) {
char *w = s;
for (char *r = s; *r; r++)
if (*r != c) *w++ = *r;
*w = '\0';
}
The two-cursor version is overlap-safe by construction: w never gets ahead of r, so each byte is read before it's overwritten. No memcpy, no memmove, no UB.
Rule of thumb: if the source and destination of a mem-family call could possibly touch the same bytes, reach for memmove. The performance gap on modern glibc is negligible; the correctness gap is infinite.
memcpy demands non-overlapping regions — when source and destination might touch, use memmove, because "it worked yesterday" is not a guarantee the optimizer respects.
