Fastest 3D angle/axis rotation

@daveedvdv If you look at the list of links at the top of the page you’ll see one called “Issue Tracker”. That’s the best place to post feature requests.

@Andrew_Stacey: Wow, I completely missed that. Thanks for the tip… I’ll try to write up something tomorrow.

@Ignatz: I found your series of blog posts about lighting, and added that to the demo above. It’s pretty neat. I love your patient explanation of that stuff, BTW — many thanks for that! For now, I mostly just used the code as is, but I’ll study it more. One thing I want to look at is the effect of the normals (the ones in the cylinder generator aren’t the normals on the triangles but on the idealized cylinder).

@daveedvdv Not being a proper programmer then I tend to go for the “natural” implementation rather than the best one, so if you have any suggestions for optimisation in the Quaternion code then I’d love to hear them.

I wrote a post/ebook on Quaternions and why to use them, if you’re interested you can read it in the Codea section of my website. Unlikely to be anything you don’t already know, though.

@Andrew_Stacey: I love your write-ups: Even the typography is awesome! (Although I learned about quaternions in high-school, I’m not really a graphics programmer and so my knowledge is quite theoretical. All refreshers are useful.)

I’ll try to find some time to take a shot at optimizing some of your routines. For an interpreted environment (like Lua) the trick is often to avoid writing out explicit operations and instead have them done implicitly (i.e., use vec:dot, vec:cross, etc. instead of the underlying expressions). I’ll follow up in this thread if I have something significant.

Do you have a test suite for your library?

(Too bad Apple has made it impossible to implement JIT compilers on iOS. It would be really interesting to see how things are different using LuaJIT – see http://luajit.org .)

@daveedvdv At the end of Quaternion.lua then I have a rudimentary test suite. It’s probably lacking a lot: I tended to add something there when I was getting odd results and I wanted to be sure that the basic operations were right.

@Andrew_Stacey: Which does shows that I didn’t read it well enough — sorry :-/ Thanks, that should be enough of a safety net…

@Andrew_Stacey:
Ok, I played with this a bit and got some improvements but not nearly as much as I’d hoped for. Here is a bit of changed code:

-- We run into difficulties if our vec4s contain NaNs or infs.  This
-- tests for those.
-- (Benchmarked & optimized)

local is_finite = function(q)
    local huge = math.huge
    local hugeneg = -huge
    local x, y, z, w = q.x, q.y, q.z, q.w
    return x < huge and x > hugeneg
        and y < huge and y > hugeneg
        and z < huge and z > hugeneg
        and w < huge and w > hugeneg
end
mtq["is_finite"] = is_finite

--[[
Test if we are real.
--]]

-- (Benchmarked)
mtq["is_real"] = function (q)
    return q.y == 0 and q.z == 0 and q.w == 0
end
 
--[[
Test if the real part is zero.
--]]

-- (Benchmarked)
mtq["is_imaginary"] = function (q)
    return q.x == 0
end
 
--[[
Normalise a quaternion to have length 1, safely.  Here "safely" means
that we ensure that we do not have NaNs or infs.  If we do, the
returned quaternion is the unit.
--]]
-- (Benchmarked & optimized)
local normalise = function (x)
    local q = normalize4(x)
    if is_finite(q) then
        return q
    else
        return vec4(1,0,0,0)
    end
end
mtq["normalise"] = normalise
 
--[[
Spherical length and distance.
--]]
-- (Benchmarked & optimized)
mtq["slen"] = function(q)
    q = normalise(q)
    q.x = q.x - 1
    return 2*asin(len4(q)/2)
end

-- (Benchmarked & optimized)
mtq["sdist"] = function(q1, q2)
    return 2*asin(dist4(normalise(q1), normalise(q2))/2)
end

The routines commented with “(Benchmarked and optimized)” get some speedup, but not anything like 2x or so (except for is_finite). The routines commented with just “(Benchmarked)” were benchmarked in various guises, but I couldn’t get them noticeably faster. Normalize4 is set to vec4().normalize at the top; similarly with len4.

I’m particularly disappointed in the qq product, for which I came up with the following:

    local ax = a.x
    local bx = b.x
    local c = vec3(a.y, a.z, a.w)    
    local d = vec3(b.y, b.z, b.w)
    local ri = ax*d + bx*c + cross(c, d)
    return vec4(ax*bx - dot3(c, d), ri.x, ri.y, ri.z)

I was convinced that would pay off big because most of the multiply/adds are moved to vector intrinsics, but instead I got a 2x slowdown :frowning: I’ve no idea why that is; I’d love to hear from someone familiar with the Codea/Lua internals for an analysis.

I did notice that in general vec-indexing is a bit expensive. So you could still get a little speedup in the qq product from locally caching components with something like

    local ax, ay, az, aw = a.x, a.y, a.z, a.w

Do you see any performance penalties using metatables instead of calling the functions directly? Interesting to follow the performance differences.

You can find the old runtime here https://github.com/TwoLivesLeft/Codea-Runtime but the current runtime is sadly closed source for the moment.

@tnlogy: good that you mention it. I just realized that the Quaternion library rewrites the vec3.__mult entry! which is the culprit in slowing down my “clever” rearrangement of the qq multiplication. When I disable that rewrite, the code above does get faster, but only a little.

So, yes, I’d be careful about modifying metatables for existing types (or at least, existing method entries).

Thanks for the pointer to the older Codea runtime. I’ll try to take a look tomorrow (only four hours of sleep left before my work meetings tomorrow :stuck_out_tongue: ).

@tnlogy, @Andrew_Stacey:
Oh, and the __add and __sub entries were also diverted. With that disabled my version does a million qqmultiplies in 8 seconds compared to over 13 seconds for the original code. As mentioned, I think that the original code would get some speedup from caching vec4 components; so it’s not all from intrinsics.

One way to “have my __mult and eat it” would be to explicitly use the original vec3.__mult function in your optimised quaternion code.

Presumably, is_finite would speed up a little more with moving the local huge = math.huge outside the function.

I suspect that to get significant speed-ups then one would have to have versions of the functions that don’t validate their input. Then anyone who found that this was the real bottleneck in their program could use the non-validating ones, whilst those who prefer the safety net could use the validating ones. That was, after all, the point of things like the is_finite routine. The issue with making them non-validating is that sometimes it’s not obvious to the user where one needs to insert validation. For example, some issues arise with precision: with the slerping then I was finding wild things happening if the quaternions were very close to each other and then sqrt(1 - c) (or whatever it was) was actually returning an answer bigger than 1, which then caused Big Problems when fed in to acos.

So if you can guarantee your input will work, you can skip a lot of the tests. But this is the kind of optimisation I’d only do right at the end.

On the other hand, the optimisations that you have done look good. I’ll put them in to the code.

And going back to my question about comparing dot products with multiplication, I’d like to try comparing vec3(a:dot(v), b:dot(v), c:dot(v)) with the version of v.x * a + v.y * b + v.z * c that doesn’t use inline notation.

@Andrew_Stacey:
Is this what you’re looking for?

function setup()
    local r1, r2, r3, c1, c2, c3, t0, v, w
    -- Matrix by rows
    r1 = vec3(1, 2, 3)
    r2 = vec3(4, 5, 6)
    r3 = vec3(7, 8, 9)
    -- Same by columns
    c1 = vec3(r1.x, r2.x, r3.x)
    c2 = vec3(r1.y, r2.y, r3.y)
    c3 = vec3(r1.z, r2.z, r3.z)


    local dot3 = vec3().dot
    local mul3 = vec3().__mul
    local add3 = vec3().__add

    v = vec3(1, -1, 0)
    t0 = os.clock()
    for n = 1, 1000000 do
        w = vec3(dot3(r1, v), dot3(r2, v), dot3(r3, v))
    end
    print("Time (s) by rows "..(os.clock()-t0))
    print(w)

    t0 = os.clock()
    for n = 1, 1000000 do
        w = add3(add3(mul3(v.x, c1), mul3(v.y, c2)), mul3(v.z, c3))
    end
    print("Time (s) by cols "..(os.clock()-t0))
    print(w)
end

(I get about 1.2 seconds by rows and 3.6 by columns.)

One more observation: For the column-wise matrix multiply, 0.6 of the 3.6 seconds appears to be computing v.x, v.y, and v.z. It would be nice if the intrinsic composite types (vectors and buffers in particular) had native unpack methods so you could write something like:

   local v1, v2, v2 = buf:unpackn(i, 3)
   local x, y, z = v1:unpack()

I suspect a good chunk of the remaining difference is due to add3 and mul3 having to handle mixed vector-scalar operations too.

@daveedvdv lua has a native unpack() method for tables. dont know what is the result on a vec3 though.

@Jvm38: Indeed, but AFAICT there is no such facility for vecs and buffers. I think I saw a request for it in the issue tracker with a positive note from the developers of Codea.

@daveedvdv Interesting analysis.

I’m not sure about your final statement about add3 and mul3. Surely add3 only makes sense for two 3-vectors and mul3 only for a scalar and a vector. The only ambiguity I can think of is that mul3 might have to allow for its arguments to be either way around.

@Andrew_Stacey: You’re right. I’m not sure why I was thinking just then that those functions had mixed type capabilities… So in the end I really don’t know what causes the discrepancy. Too bad we cannot easily hook up Codea to a debugger :stuck_out_tongue:

@daveedvdv - If you haven’t seen this before, Lua performance tips from its main developer

www.lua.org/gems/sample.pdf?

and some optimisation tips

http://lua-users.org/wiki/OptimisationTips

I also found some interesting speed comparisons

http://benchmarksgame.alioth.debian.org/u32/benchmark.php?test=all&lang=lua&lang2=v8&data=u32

http://flux242.blogspot.de/2013/09/javascript-vs-perl-vs-python-vs-lua.html

Awesome, @Ignatz. I’m reading the first one now and will try to work my way through the others. (I also have been reading the source of The Lua VM to understand its strength and weaknesses.)

I was actually thinking of writing up something specific for Codea, but this makes it less compelling. (OTOH, the buffer and vector types of Codea offer a whole other avenue for optimization.)