Web Presence
| ARM C optimizations |
|
|
|
| Written by JLangbridge |
| Wednesday, 07 October 2009 07:56 |
|
Whilst most of the time, you can get away with just about anything on a desktop or server system, embedded systems programming is sometimes thought of as an art. Optimized might take longer, but there are huge benefits to writing highly optimized code; the system might cost less because you don't need a high end processor, or you might get a strategic advantage because your systems runs longer on the same charge.
Timed speedups
The best way to show what optimizing can do is to have a simple example. Here is a very simple example. After each change, I recompiled the code to see the size and also the time it took to run the code. Everything around me is calculated in giga-hertz; the 2 PCs at my desk, the netbook I bring to work, my home PC... The development board infront of me is running at 400MHz. Embedded systems cannot scale like PCs; their processing power is limited, you have to be careful. There are a few tricks you can follow, that can drastically speed up the system. It was also time for me to see all the bad habits I had picked up whilst doing development for x86 and x86_64 platforms.
void loopit(void) { u16 i; //Internal variable iGlobal = 0; //Global variable //16 bit index incrementation for (i = 0; i < 16: i++) { iGlobal++; } }
This is extremely simple; it loops 16 times, each time incrementing a global variable. Compiled, transferred, run. The application code comes in at 24 bytes, and on the dev platform, it runs in 138µs. Most people would think that is great, but embedded engineers go pale with this. There is nothing strictly wrong with the code, it is perfectly good C code, but there are things you can do. ARM systems, and indeed a lot of systems, actually do better when they count down to zero. The reason is simple; every time you make a calculation, the processor automatically compares it to zero, and sets a processor flag. On this platform, the bit in question is Z in the CPSR register. At the end of each loop, we compare i to an integer. If we decreased to zero, we could add a jump condition if Z is true, saving cycles. So, let's make a quick change, and count down to zero:
void loopit(void) { u16 i; //Internal variable //16 bit index decrementation for (i = 16; i != 0: i--) { iGlobal++; } }
void loopit(void) We haven't done a lot here; all we have dones is to declare a new variable and use that for the loop instead. Once again, there is no size change; there are new codes for accessing a register, but we don"t have the codes to access RAM. The fact that our variable is now in a register is a considerable speed boost; our execution time os now down to 52µs. But we can still do better... The loop creates a lot of overhead, and if possible, unrolling a loop can help: The Art of optimizingArt, with a capital A. Optimizing isn't that simple, it takes time and effort to have something that is truly optimized. From experience, it is difficult to have code that is truly optimized when you are pushed to the limits by deadlines, but careful planning and solid foundations are the key. The very first thing to do is to read the technical specifications of the processor you are going to use, all your code needs to be written specifically for this system. Generic ANSI C code can be ported to almost everything, but that doesn't mean that it is going to work well on this platform. Learn the strengths of the CPU, and use them. Learn the weakness of the CPU, and avoid them. In this example I'm using an ARM, and they are exceptionally good in low-power embedded systems. Most of my ARM development has been done with mobile phones, where battery life is crucial. We couldn't have done it with anything else. Their weak point is division, avoid where possible. More on that later.
Choose wisely
Most engineers will not start off with highly optimized code, but just good code. Indeed, I advise you not to start off optimizing everything, just write clever code. Later on, you'll get to choose. Optimization comes at a cost, and it takes much longer to create optimized code than just good code. It might take you two hours to create that routine that will scan the memory looking for some data to work on, but optimizing it could take days. Write the program first, then profile your application and see where optimization is needed. Optimizing a large routine that is called just once might not be as critical as optimizing a small routine that is called hundreds of times. Shaving off 50 milliseconds from boot time might not be as important as saving a few microseconds during an interrupt. Know when to optimize, and when to say that your code is just "good enough". General rulesThere are a few rules to follow concerning C development on ARM, and this applies to almost any processor out there. IntegersIntegers are a vital part of any development. Processors were designed to handle integers, not floating point or any other type of numeral. Most integer operations can be done in just a few cycles, with one notable exception that we will go into later on. Generally, always use integers that are the same width as the system bus, this will avoid unwanted calculations later on. While an u16 might be all you need to hold in the data, if the variable is heavily used on a 32-bit system, making it a u32 will speed things up. When reading in a u16, the processor will invariably read in a u32, then do some operations to transform it into a u16, which costs cycles. Also, if you know that your variable will only handle positive numbers, make it unsigned. Most processors can handle unsigned integer arithmetic considerably faster than signed (this is also good practice, and helps make for self-documenting code). Always try to make your code use integers. If you need 2 decimal places, multiply your figures by 100 instead of using floating point. DivisionThis is the Achilles heel of the ARM processor. General rule: if you can avoid dividing, avoid it. ARMs cannot natively devide, they rely on external libraries for this. A 32-bit division can take up to (and in some cases more than) 120 cycles. Sometimes you can get away with multiplying instead of dividing, especially when comparing. (a / b) > c can sometimes be rewritten as a > ( c * b ).Division by 2There is an exception to the division rule, dividing by 2. This, technically, isn't a division, and the compiler will not do the same thing. The compiler ends up doing a shift, which costs considerably less in terms of cycles. This applies to powers of 2; dividing by 2, 4, 8, 16, 32, etc., will simply shift the contents of a register. |
| Last Updated on Saturday, 02 July 2011 14:48 |





Comments
Optimisation is the key, and as i often say about computer dev, the today computer power should never have been an excuse to stop optimisation!
RSS feed for comments to this post