ARM C optimizations PDF Print E-mail
Written by JLangbridge   
Wednesday, 07 October 2009 07:56

Whilst most of the time, you can get away with just about anything on a desktop or server system, embedded systems programming is sometimes thought of as an art. Optimized might take longer, but there are huge benefits to writing highly optimized code; the system might cost less because you don't need a high end processor, or you might get a strategic advantage because your systems runs longer on the same charge.

 

Timed speedups

 

The best way to show what optimizing can do is to have a simple example. Here is a very simple example. After each change, I recompiled the code to see the size and also the time it took to run the code.

Everything around me is calculated in giga-hertz; the 2 PCs at my desk, the netbook I bring to work, my home PC... The development board infront of me is running at 400MHz. Embedded systems cannot scale like PCs; their processing power is limited, you have to be careful. There are a few tricks you can follow, that can drastically speed up the system. It was also time for me to see all the bad habits I had picked up whilst doing development for x86 and x86_64 platforms.

Let's start off with a basic routine, a simple loop:

 

void loopit(void) 
{ 
   u16 i; //Internal variable 
 iGlobal = 0; //Global variable 
 
 //16 bit index incrementation 
 for (i = 0; i < 16: i++) 
 { 
 iGlobal++; 
 } 
}

 

This is extremely simple; it loops 16 times, each time incrementing a global variable. Compiled, transferred, run. The application code comes in at 24 bytes, and on the dev platform, it runs in 138µs. Most people would think that is great, but embedded engineers go pale with this. There is nothing strictly wrong with the code, it is perfectly good C code, but there are things you can do. ARM systems, and indeed a lot of systems, actually do better when they count down to zero. The reason is simple; every time you make a calculation, the processor automatically compares it to zero, and sets a processor flag. On this platform, the bit in question is Z in the CPSR register. At the end of each loop, we compare i to an integer. If we decreased to zero, we could add a jump condition if Z is true, saving cycles. So, let's make a quick change, and count down to zero:

 

void loopit(void) 
{
 u16 i; //Internal variable
 
 //16 bit index decrementation
 
 for (i = 16; i != 0: i--)
 {
 iGlobal++;
 }
}


The size of the code remains unchanged, we are still at 24 bytes, but the execution time is faster; 124µs. That is a bit of a speed gain, but there is still a lot we can do. The code here uses a variable that is 16-bits long, presumably to save space. Our loop will only loop 16 times, so why bother having a 32-bit variable? 16 should do. Actually, it does, but it isn't a good idea. This particular ARM core is 32-bit native, and using 16-bit values takes up valuable processor power, since the processor has to convert a 16-bit variable to 32-bit, work with it, then re-transform it to 16 bits. Working with the native size can help. So let's turn that into a 32-bit variable:

void loopit(void)
{
    u32 i; //Internal variable
    iGlobal = 0; //Global variable
 
    //32 bit index decrementation
    for (i = 16; i != 0: i--)
    {
        iGlobal++;
    }
}
 
Now, debugging, we notice something. The size has gone down to 20 bytes, since there are less instructions needed. Execution time has also gone down to 115µs, again, since there are less instructions to execute. The joys of optimizing! But we aren't done yet. That global variable is a nightmare; evey time we loop, the processor needs to access the RAM to change a variable, and that uses up valuable time. So, let's define a variable, keep it local, and at the end of the loop, copy it back to the global variable:
 
void loopit(void)
{
    u32 i; //Internal variable
    u32 j;
    iGlobal = 0; //Global variable
 
    //32 bit index decrementation
    for (i = 16; i != 0: i--)
    {
        j++;
    }
    iGlobal = j; //Copy the local variable's value to the global variable
}

We haven't done a lot here; all we have dones is to declare a new variable and use that for the loop instead. Once again, there is no size change; there are new codes for accessing a register, but we don"t have the codes to access RAM. The fact that our variable is now in a register is a considerable speed boost; our execution time os now down to 52µs. But we can still do better... The loop creates a lot of overhead, and if possible, unrolling a loop can help:
 
void loopit(void)
{
    u32 i; //Internal variable
    u32 j;
    iGlobal = 0; //Global variable
 
    //32 bit index decrementation
    for (i = 4; i != 0: i--)
    {
        j++; j++; j++; j++;
    }
    iGlobal = j; //Copy the local variable's value to the global variable
}
 
This time, we are only looping 4 times, instead of 12, and doing 4 times the work inside the loop. This might sound strange, but this saves considerable overhead time; looping and forking take up valuable cycles. With this new routine, the size has gone up slightly to 22 bytes, but the execution time is down to 26µs. There wasn't anything wrong with the code, but there are always ways to optimize, and time should be taken for code optimization, especially on embedded systems. That isn't an excuse for not being careful on powerful platforms; just because processors go faster and faster, it isn't a reason to use up valuable cycles.


The Art of optimizing

 

Art, with a capital A. Optimizing isn't that simple, it takes time and effort to have something that is truly optimized. From experience, it is difficult to have code that is truly optimized when you are pushed to the limits by deadlines, but careful planning and solid foundations are the key. The very first thing to do is to read the technical specifications of the processor you are going to use, all your code needs to be written specifically for this system. Generic ANSI C code can be ported to almost everything, but that doesn't mean that it is going to work well on this platform. Learn the strengths of the CPU, and use them. Learn the weakness of the CPU, and avoid them. In this example I'm using an ARM, and they are exceptionally good in low-power embedded systems. Most of my ARM development has been done with mobile phones, where battery life is crucial. We couldn't have done it with anything else. Their weak point is division, avoid where possible. More on that later.

 

Choose wisely

 

Most engineers will not start off with highly optimized code, but just good code. Indeed, I advise you not to start off optimizing everything, just write clever code. Later on, you'll get to choose. Optimization comes at a cost, and it takes much longer to create optimized code than just good code. It might take you two hours to create that routine that will scan the memory looking for some data to work on, but optimizing it could take days. Write the program first, then profile your application and see where optimization is needed. Optimizing a large routine that is called just once might not be as critical as optimizing a small routine that is called hundreds of times. Shaving off 50 milliseconds from boot time might not be as important as saving a few microseconds during an interrupt. Know when to optimize, and when to say that your code is just "good enough".

 

General rules

 

There are a few rules to follow concerning C development on ARM, and this applies to almost any processor out there.
Integers

Integers are a vital part of any development. Processors were designed to handle integers, not floating point or any other type of numeral. Most integer operations can be done in just a few cycles, with one notable exception that we will go into later on. Generally, always use integers that are the same width as the system bus, this will avoid unwanted calculations later on. While an u16 might be all you need to hold in the data, if the variable is heavily used on a 32-bit system, making it a u32 will speed things up. When reading in a u16, the processor will invariably read in a u32, then do some operations to transform it into a u16, which costs cycles. Also, if you know that your variable will only handle positive numbers, make it unsigned. Most processors can handle unsigned integer arithmetic considerably faster than signed (this is also good practice, and helps make for self-documenting code). Always try to make your code use integers. If you need 2 decimal places, multiply your figures by 100 instead of using floating point.

Division
This is the Achilles heel of the ARM processor. General rule: if you can avoid dividing, avoid it. ARMs cannot natively devide, they rely on external libraries for this. A 32-bit division can take up to (and in some cases more than) 120 cycles. Sometimes you can get away with multiplying instead of dividing, especially when comparing. (a / b) > c can sometimes be rewritten as a > ( c * b ).
Division by 2

There is an exception to the division rule, dividing by 2. This, technically, isn't a division, and the compiler will not do the same thing. The compiler ends up doing a shift, which costs considerably less in terms of cycles. This applies to powers of 2; dividing by 2, 4, 8, 16, 32, etc., will simply shift the contents of a register.

Last Updated on Saturday, 02 July 2011 14:48
 

Comments  

 
0 # Tout bonnement génial!Els 2009-10-07 16:32
I'm not a dev for one cent, but i still understand what you say about optimisation

Optimisation is the key, and as i often say about computer dev, the today computer power should never have been an excuse to stop optimisation!
Reply | Reply with quote | Quote
 

Add comment


Security code
Refresh