I’ve just tried some more examples with jmv38’s code.
First, the good news: a single for-loop is faster than a double for-loop.
Next, the bad news: the difference is so insignificant that it is wiped out if you have to do anything to compensate for the use of a single loop instead of a double loop.
With identical code inside the loop I get the following figures:
- Double loop: 1.91815s
- Single loop: 1.91681s
(total iterations the same as jmv38: 16e6; I’m on an iPad2)
So that’s a gain of less than 0.01%
Now suppose I really want to get at the variables in the double loop. So something like the original dummy = i + j
. (Note: there’s no need to make i
and j
local as that’s automatic inside the double loop, and dummy
should be made local.) The double loop simply has dummy = i + j
. The single loop has to compute the i
and j
. There are a variety of ways to do this. The code in jmv38’s example is very expensive. A cheaper way to do it is with local variables x
and y
which are initially set to 1
and 1
and an auxiliary (local) variable w
set to width + 1
; then in the loop (at the end) we do:
x = x + 1
if x == w then
y = y + 1
x = 1
end
The times are now:
- Double loop: 1.9252s
- Single loop: 4.13666s
The above might seem a little wasteful as we always do x = x + 1
and sometimes throw away the result. Surely an if ... then ... else ... end
would be better!
if x == w then
y = y + 1
x = 1
else
x = x + 1
end
(here w
is set to width
) gives 4.38278s! So we lose .2s by introducing an else
branch into our conditional. In fact, removing the conditional altogether (whence just having the x = x + 1
statement - whence the loops are no longer functionally equivalent) drops the time to 2.4726s. So the conditional is adding over a second to the time whilst the addition operation is adding only half a second. We can narrow this down even further by simply putting in a conditional and nothing else:
if x == w then
end
(so no increment on x
) gives 3.57404s.
I wondered about optimising in other ways. How about caching the values of i+j
and simply using a look-up? The point of this would be that if you are going to do the same loop lots of times it might be quicker to save the results on the first loop and reuse them. Sadly, no. Storing i + j
in a table and then putting dummy = t[i]
in the loop does not help: I got 3.40796s for that (just for the look-up loop, that is).
Now there are lots of ways to compute x + y
from i
(where x = i%width + 1
and y = math.floor(i/width) + 1
) using a variety of tests and operations. I’ve yet to find one that brings the total computation down to anywhere near comparable to the double loop.
However, this experimenting does say that it is worth considering alternative ways to do necessary computations inside the loops. For example, % is more expensive than / by a long way, but using math.floor
even more so (even when localised).