Optimising vector performance

I recently discovered that a simple three line function with one normalize and one dot function, could be speeded up over 4x by getting rid of the vectors.

What I mean by that, is breaking vectors up into separate x,y,z (scalar) values, and doing calculations on each separately - and avoiding expensive operations like square roots.

So this

function IsVisible(pos,radius) --pos is vec2
  local p=pos+cameraDirection*tangentAdjust
  local v=(p-cameraPos):normalize()
  return v:dot(cameraDirection)>cosFOV
end

can be made over 4x faster with this

function IsVisible(pos,radius)
    local px,py=pos.x,pos.y 
    local dx,dy=px-camposX,py-camposY
    if dx*camdirX+dy*camdirY<0 then return end
    local u=radius*tangentAdjust
    local ptx,pty=px+camdirX*u-camposX,py+camdirY*u-camposY
    local sq=ptx*ptx+pty*pty
    local a=ptx*camdirX+pty*camdirY
    return a*a>cosFOV2*sq
end

Simeon explains it like this:

The performance difference appears to be due to allocations, every vector mult / sub / add has to allocate a new vector object as Lua user data to return its results. The overhead in the allocations accounts for all the difference in performance.
I'm going to look into whether we can come up with an alternate memory allocator for lots of small short-lived objects.

Note that this problem exhibits itself because the vectors are short-lived and created / deleted constantly. Using vectors in a more long-term scenario should be totally fine (the overhead will not really be noticeable without lots of operations).

You haven’t linked to the blog post!

https://coolcodea.wordpress.com/2015/11/17/optimal-culling/

Thanks for the shout out.

4 x increase is pretty amazing, I can see I’m going to have to rewrite chunks of code where I’m having performance issues. It’s a shame in a way though, as the 3-line version of the function is so much more readable than the fast version.

Just a note that you should be able to do:

local px, py = pos:unpack()

To get the elements out of a vector for the purposes of writing a decomposed version of a function.

I did some more testing, and while functions like normalize seem to be fairly optimal, vector arithmetic can be slower than scalar equivalents.

For example, v1a+v2 can be much slower than vec2(v1.xa+v2.y,v1.y*a+v2.y)

(Also, for some reason, a^0.5 is way faster than sqrt(a), even if you’ve localized sqrt).