GLES 3.0, instancing, and other changes to shaders in Codea 2.3.2

I have some disappointing news about instancing. I tried it with a simple OBJ model to see what the performance was like, and using instancing was actually significantly slower than just drawing the mesh at multiple locations. A 4572 vert model, with 60 instances (so 274320 verts altogether, not all that high). Drawing the mesh at 60 locations and the framerate stays at 60, while drawing the mesh once, with 60 instances, and the framerate drops to 45-47 or so. =((

I’ll post some code if I can clean it up a little.

I guess more testing (and reading) is necessary to try to work out which situations actually benefit from instancing, maybe it is the case that it’s more suited for particle type systems, 100s of instances of a relatively small amount of geometry, rather than, as here, smallish batches of larger models.

@yojimbo2000 interesting, I wonder if it varies by hardware.

Edit: if you have a sample where instancing is slower, share it here and I’ll put it through the profiler to see what’s going on.

@Simeon here is the code:

https://gist.github.com/Utsira/5dcfd0ad57ceed56d8d2

It will download the model you choose from GitHub. 60 copies of model number 2 (the default) is what I’ve used for testing. If you press “Load Normal”, it will load and display the number of instances by drawing repeatedly. If you press “LoadInstanced” it will display the copies with instancing. I even swapped out the specular highlight shader for just diffuse shading in the instanced version to try to get it to run faster.

I’m on an iPad Air 1.

Ok this is cool. It’s the same example that @Simeon first posted, but instead of supplying an array of transformations, it calculates the positions by referring to the gl_InstanceIDEXT variable (It’s the instance number).

Interestingly, I could only access this by setting instanced drawing on with #extension GL_EXT_draw_instanced: enable. @Simeon I’m a bit confused, I’d assumed that the Codea mesh API must have been adding this line to the shaders automatically for instanced drawing to be accessible in GLES 2.0? How are you getting the instanced drawing in GLES 2 if you’re not using this extension?

function setup()
    m = mesh()

    m:addRect(0,0,30,30)
    m:setColors(color(255,200,0))
    m.shader = shader(vert, frag)

    numInstances = 100

end

function draw()
    background(40, 40, 50)

    translate(WIDTH/2 - 200, HEIGHT/2 - 200)

    -- mesh:draw can now take a number of instances
    -- it draws this many, instanced buffers can be
    -- used to differentiate each instance within a
    -- shader
    m:draw(numInstances)
end

vert=[[
#extension GL_EXT_draw_instanced: enable 

uniform mat4 modelViewProjection;

attribute vec4 position;
attribute vec4 color;

varying lowp vec4 vColor;

void main()
{
    vColor = color;
    float xOffset = mod(float(gl_InstanceIDEXT), 10.) * 40. - 2.5;
    float yOffset = float(gl_InstanceIDEXT / 10) * 40. - 2.5;
    vec4 offset = vec4(xOffset, yOffset, 0, 0);
    gl_Position = modelViewProjection * (position + offset);
}
]]

frag=[[

varying lowp vec4 vColor;

void main()
{
    gl_FragColor = vColor;
}
]]

@yojimbo2000 I took your code and added more instances so I could check the average frames per second. Both programs draw 16240 rects. The first program uses instanceing and the second just creates a mesh with that many rects. The instanceing runs at an average FPS of 54.5 while the non instanceing runs at an average of 59.6 . I think I did everything right.

displayMode(FULLSCREEN)
--54.5

function setup()
    m = mesh()
    m:addRect(0,0,4,4)
    m:setColors(color(255,200,0))
    m.shader = shader(vert, frag)
    numInstances = 140*116
    tot,cnt=0,0
end

function draw()
    background(40, 40, 50)
    fill(255)
    tot=tot+DeltaTime
    cnt=cnt+1
    text("Avg FPS  "..string.format("%.2f",cnt/tot),WIDTH/2,HEIGHT-50)
    text("numInstances  "..numInstances,WIDTH/2,HEIGHT-80)
    translate(30,50)
    m:draw(numInstances)
end

vert=[[
    #extension GL_EXT_draw_instanced: enable     
    uniform mat4 modelViewProjection;
    attribute vec4 position;
    attribute vec4 color;    
    varying lowp vec4 vColor;
    void main()
    {   vColor = color;
        float xOffset = mod(float(gl_InstanceIDEXT), 140.) * 5.;
        float yOffset = float(gl_InstanceIDEXT / 140) * 5.;
        vec4 offset = vec4(xOffset, yOffset, 0, 0);
        gl_Position = modelViewProjection * (position + offset);
    }
    ]]

frag=[[    
    varying lowp vec4 vColor;
    void main()
    {   gl_FragColor = vColor;
    }
    ]]
displayMode(FULLSCREEN)
--59.6

function setup()
    m = mesh()
    xs,ys=140,116
    for x=1,xs do
        for y=1,ys do
            m:addRect(x*5,y*5,4,4)
        end
    end
    m:setColors(color(255,200,0))
    numInstances = xs*ys
    tot,cnt=0,0
end

function draw()
    background(40, 40, 50)
    fill(255)
    tot=tot+DeltaTime
    cnt=cnt+1
    text("Avg FPS  "..string.format("%.2f",cnt/tot),WIDTH/2,HEIGHT-50)
    text("numInstances  "..numInstances,WIDTH/2,HEIGHT-80)
    translate(30,50)
    m:draw()
end

@yojimbo2000 I imagine the gl_InstanceID variable might only be available in #version 300 es shaders.

One nice thing about instancing is that if you’re doing some kind of procedural animation, such as a disintegrating explosion shader, you’d normally have to set up some attribute that indicates which face the vertex belongs to, and where the centre of that face is, whereas that is handled automatically with instancing.

One thing I didn’t take into account with my programs above is that the non instanceing mesh is static, so the same mesh is drawn constantly making it faster. I tried two other programs based on the 2 programs above where I moved all 16240 rects around per draw cycle. The non instanceing program went from 59 FPS to 4 FPS. The instanceing program went from 54 FPS to 47 FPS. So if you want to move a lot of rects around, it looks like instanceing works well.

Here are my results on an Air 2

Dave’s code with 120,000 [non moving] rects (2x2 to fit them on the screen)
Non instancing = close to 60
Instancing = 16

Yojimbo’s models - instancing is about 2/3 of the speed of non instancing

I’ve only just started playing around, will do some more and report

@Ignatz thank you for the Air 2 results!

My suspicion is that the multi-threaded renderer allows the non-instanced rendering to feed more geometry to the GPU. That is, non-instanced is able to utilise more of the CPU to do geometry uploads to the GPU.

The Air 2 results show a bigger difference because there are 3 CPU cores. The multi-threaded Codea renderer can keep queuing up non-instanced mesh calls.

@Simeon - Thanks for the explanation

Let me know if you want any more tests

Here is an adaptation of @LoopSpace 's explosion/ disintegration shader. It blows an image up into lots of little fragments. Everything is calculated from the instance ID. It fakes some “noise” by sampling the image texture (so you can see brighter parts of the image fly further when it fragments). You’d get a better result if you uploaded an additional noise texture to the vert shader, but I wanted to keep things simple. My suspicion is that @LoopSpace 's original will perform better, because all of the trajectory calculations are performed in advance and then preloaded into buffers in that version. But I do like the simplicity of the Codea side of this version. You only have to define one rect, no buffers, and instancing does the rest.

EDIT: all calculations now done as vec2s

function setup()
    m = mesh()
    local rows = 100 --number of rows and columns
    numInstances = math.tointeger( rows ^ 2) --instanced drawing does not work with, eg 400.0, must be a true integer
    m.texture = "Cargo Bot:Codea Icon"
    local quadSize = 4 --size in pixels of each rect
    m:addRect(0,0,quadSize, quadSize)
    m:setRectTex(1,0,0,1/rows,1/rows)
    m.shader = shader(vert, frag)
    m.shader.rows = rows
    m.shader.quadSize = quadSize
    explode = {animate = 0}
    exploded = 1 --a flag to toggle the explosion
    print("total verts:", numInstances * 6)
    print("tap to explode/ unexplode")
end

function draw()
    background(40, 40, 50)

    translate(WIDTH/2, HEIGHT/2)

    -- mesh:draw can now take a number of instances
    -- it draws this many, instanced buffers can be
    -- used to differentiate each instance within a
    -- shader
    m.shader.time = explode.animate
    m:draw(numInstances)
end

function touched(t)
    if t.state == BEGAN then
        tween.stopAll()
        local target = exploded * 8
        local time = math.abs(target - explode.animate) * 0.5
        tween(time, explode, {animate=target})
        exploded = 1 - exploded
    end
end

vert=[[
#extension GL_EXT_draw_instanced: enable 

uniform mat4 modelViewProjection;
uniform sampler2D texture;
uniform float time;
uniform float quadSize;
uniform float rows;
float texel = 1./rows;
float halfRows = (rows - 1.) * .5;

attribute vec4 position;
attribute vec4 color;
attribute vec2 texCoord;

varying lowp vec4 vColor;
varying mediump vec2 vTexCoord;

const vec2 gravity = vec2(0.,-400. ); //down on the y axis
const float friction = 1. ;

void main()
{
    vColor = color;

    float xOffset = mod(float(gl_InstanceIDEXT), rows) ; //calculate offset based on instance number
    float yOffset = float(gl_InstanceIDEXT ) / rows;
    mediump vec2 texOffset = vec2(xOffset, yOffset) * texel; 
    
    vTexCoord = texCoord + texOffset; //apply offset to texCoord
    xOffset -= halfRows; //make origin the centre
    yOffset -= halfRows;
    vec2 offset = vec2(xOffset , yOffset) * quadSize; //apply offset to position

    vec4 noise = texture2D(texture, texOffset) -vec4(0.5); //sample the texture to add some "noise"
    vec4 noise2 =texture2D(texture, vec2(1.)-texOffset.yx) -vec4(0.5);

    vec2 velocity = (normalize(offset) + ((noise.gr * noise2.rb) * 3. )) * 300.; 
    lowp float angle = time * (noise.b * noise2.g) * 45.;

    highp vec2 A = gravity/(friction*friction) - velocity/friction;
    highp vec2 B = offset - A; 

    float angCos = cos(angle);
    float angSin = sin(angle);
    lowp mat2 rot = mat2(angCos, angSin, -angSin, angCos);
    
    vec2 pos = rot * position.xy;
    pos += exp(-time*friction)*A + B + time * gravity/friction; 

    gl_Position = modelViewProjection * vec4(pos, 0., 1.); 
}
]]

frag=[[
#extension GL_EXT_draw_instanced: enable
uniform sampler2D texture;

varying lowp vec4 vColor;
varying mediump vec2 vTexCoord;

void main()
{
    gl_FragColor = texture2D(texture, vTexCoord) * vColor;
}
]]

Here is a comment I found on a forum that may help explain why we aren’t seeing better performance from instancing.

"Instancing of this form (that is, sending the same mesh data with different instance data) is generally only useful performance-wise if all of the following are true:

  1. The mesh you want to render instanced is relatively small, in terms of number of vertices, but not too small (at least ~100 vertices, up to around ~5000 or so)

  2. The number of instances of this specific mesh being rendered is large (>1000)"

The OpenGL wiki seems to support this

“It is often useful to be able to render multiple copies of the same mesh in different locations. If you’re doing this with small numbers, like 5-20 or so, multiple draw commands with shader uniform changes between them (to tell which is in which location) is reasonably fast in performance. However, if you’re doing this with large numbers of meshes, like 5,000+ or so, then it can be a performance problem, and instancing can help.”

@yojimbo2000 - re your jittery example above, I just reduced the offset size until it became smooth, for me that was 0.00005

Now that 2.3.2 is out, I removed the beta tag from this thread