Question

I am trying to figure out how many clock cycles or total instructions it takes to access a pointer in C. I dont think I know how to figure out for example, p->x = d->a + f->b

i would assume two loads per pointer, just guessing that there would be a load for the pointer, and a load for the value. So in this operations, the pointer resolution would be a much larger factor than the actual addition, as far as trying to speed this code up, right?

This may depend on the compiler and architecture implemented, but am I on the right track?

I have seen some code where each value used in say, 3 additions, came from a

 f2->sum = p1->p2->p3->x + p1->p2->p3->a + p1->p2->p3->m

type of structure, and I am trying to define how bad this is

Answer 1

This depends on the architecture at hand.

Some architectures can reference/dereference memory for an instruction without first loading it into a register, others don t. Some architectures don t have the notion of instructions that compute the offsets for you to dereference and will make you load the memory address, add your offset to it, and then allow you to dereference the memory location. I m sure there are more variances chip-to-chip.

Once you get past these, each instruction takes varying amount of time depending on the architecture as well. To be honest though, it s an overhead that is very, very minimal.

For your immediate question of dereferencing a chain of items, the slowness will come in the fact that there is likely a poor locality of reference the farther you go in a dereferencing chain. This means more cache misses, which means more hits to main memory (or disk!) to get the data. Main memory is very slow compared to the CPU.

Answer 2

Some IDEs like VisualStudio allow you to view the assembly generated along with the source code.

How to view the assembly behind the code using Visual C++?

Then you can see for your exact architecture and implementation what it looks like.

If you are using GDB (linux, mac) use disassemble

(gdb) disas 0x32c4 0x32e4
Dump of assembler code from 0x32c4 to 0x32e4:
0x32c4 <main+204>:      addil 0,dp
0x32c8 <main+208>:      ldw 0x22c(sr0,r1),r26
0x32cc <main+212>:      ldil 0x3000,r31
0x32d0 <main+216>:      ble 0x3f8(sr4,r31)
0x32d4 <main+220>:      ldo 0(r31),rp
0x32d8 <main+224>:      addil -0x800,dp
0x32dc <main+228>:      ldo 0x588(r1),r26
0x32e0 <main+232>:      ldil 0x3000,r31
End of assembler dump.

Answer 3

Depends what you are doing, a trivial pointer dereference y = *z; where

int x = 1;
int* z = &x;
int y;

might assemble to something like this on the x86:

mov eax, [z]
mov eax, [eax]
mov [y], eax

and y = x would still take a memory dereference:

mov eax, [x]
mov [y], eax

Mov instructions to memory take about 2-4 cycles IIRC.

Although, if you are loading memory from completely random locations, you will be causing a lot of page faults, resulting in hundreds of clock cycles being wasted.

Answer 4

Where it can, the compiler will remove that overhead for you by keeping repeatedly-used base locations in a register (eg. p1->p2->p3 in your example).

However, sometimes the compiler can t determine which pointers might alias other pointers used within your function - which means that it has to fall back to a very conservative position, and reload values from pointers frequently.

This is where C99 s restrict keyword can help. It lets you inform the compiler when certain pointers are never aliased by other pointers in the scope of the function, which consquently can improve the optimisation.

For example, take this function:

struct xyz {
    int val1;
    int val2;
    int val3;
};

struct abc {
    struct xyz *p2;
};

int foo(struct abc *p1)
{
    int sum;

    sum = p1->p2->val1 + p1->p2->val2 + p1->p2->val3;

    return sum;
}

Under gcc 4.3.2 with optimisation level -O1, it compiles to this x86 code:

foo:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %eax
    movl    (%eax), %edx
    movl    4(%edx), %eax
    addl    (%edx), %eax
    addl    8(%edx), %eax
    popl    %ebp
    ret

As you can see, it only deferences p1 once - it keeps the value of p1->p2 in the %edx register and uses it three times to fetch the three values from that structure.

友情链接