Monday, July 21, 2008

Single-stepping a REP STOS...

I've recently made a startling discovery that explains a LOT about how kmemcheck has been working (or not) on one of my machines.

Yesterday I added the kmemcheck hooks into the DMA API, which means that we should now not give any false-positive errors about DMA-able memory. The patch was essentially a one-liner, since the rest of the DMA API deals with whole pages only (and they come straight from the page allocator, so they're not tracked anyway).

But still I was getting a huge amount of errors from sysfs code. This puzzled me for many hours. The code was apparently okay. In fact, the allocation in question was explicitly being zeroed out (it was calling kzalloc()), so it shouldn't have been possible to even find any use of uninitialized memory in it. I added some code to dump the memory along with the shadow dump, and it showed that the array was indeed being zeroed, but not marked initialized.

Because memset() is such a common operation (and needs to be fast), I've written a custom memset() function that checks whether the target memory is being tracked or not; if it is, we don't have to take any page fault at all, but we can simply zero the memory and set the initialization status to "initialized" at the same time.

I suspected my custom memset() of being in error. So I commented it out. And the result was even more startling; I now got a kmemcheck error on just about every memory access...

Something had to be wrong with the built-in memset. This is its definition:

static inline void * __memset_generic(void * s, char c,size_t count)
{
int d0, d1;
__asm__ __volatile__(
"rep\n\t"
"stosb"
: "=&c" (d0), "=&D" (d1)
:"a" (c),"1" (s),"0" (count)
:"memory");
return s;
}


(Source code taken from the Linux Kernel. This code is licensed under the GNU GPL version 2.)

Taking a new look at the first kmemcheck error reported, I got another clue. The kzalloc() allocation was uninitialized except for the very first byte!

I wrote a short program for userspace which called the above function and single-stepped it with gdb on my two machines, one P4 3.0 GHz, and one Pentium Dual-Core 1.47 GHz. To my great surprise, the P4 skipped the whole REP STOS construct in one go, while the Dual-Core got a trap for each repetition of the STOS.

This is very grave news for kmemcheck. I had tested it on my Dual-Core earlier and thought that it would behave the same for all CPUs; there's not a word about single-stepping REP instructions in the Intel's System Development Manuals that I could find, anyway.

There are basically two solutions to the problem:
  1. Black-list the models that don't single-step each repetition
  2. Emulate the instruction (i.e. no single-stepping at all for the REP instructions)
And neither of them is particularly attractive.

But I have started working on instruction emulation to see how far I can get.

I've also contacted Intel and asked for more information about the peculiarity.

I only hope that there are not more unpleasant surprises like this...


Vegard


PS: I have written a (32-bit) userspace program that backs up my suspicions. Compare these outputs:

$ ./a.out
processor : 7
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5320 @ 1.86GHz

Counted 1000 REP STOS instructions (expected 1000).

$ ./a.out
processor : 1
cpu family : 15
model : 6
model name : Intel(R) Pentium(R) 4 CPU 3.00GHz

Counted 55 REP STOS instructions (expected 1000).

No comments: