Surprisingly slow performance on tight loop with integer operations on SIMD/List get/set #2857

lukehoban · 2024-05-27T19:20:44Z

lukehoban
May 27, 2024

The following program runs ~2,800 times slower in Mojo than the equivalent Go program (https://github.com/lukehoban/um-go).

import builtin.file
import builtin.io
import time

fn main() raises:
    var program = List[UInt32]()
    var f = file.open("sandmark.umz", "r")
    while True:
        var byts = f.read_bytes(4)
        if len(byts) < 4:
            break
        var i: UInt32 = 0
        for n in range(4):
            i = (i << 8) + byts[n].cast[DType.uint8]().cast[DType.uint32]()
        program.append(i)
    var platters = List[List[UInt32]](program)
    var reg = SIMD[DType.uint32, 8](0,0,0,0,0,0,0,0)
    var finger = 0
    var iteration = 0
    var start = time.now()
    while True:
        var v = platters[0][finger]
        finger = finger+1
        var op = v >> 28
        var a = int((v >> 6) & 0b111)
        var b = int((v >> 3) & 0b111)
        var c = int((v >> 0) & 0b111)
        iteration = iteration + 1
        if iteration % 1000000 == 0:
            print((Float32(iteration) * 1e9) / Float32(time.now()-start), " ops/sec")
        if op == 0:
            if reg[c] != 0:
                reg[a] = reg[b]
        elif op == 1:
            reg[a] = platters[int(reg[b])][int(reg[c])]
        elif op == 2:
            platters[int(reg[a])][int(reg[b])] = reg[c]
        elif op == 3:
            reg[a] = reg[b] + reg[c]
        elif op == 4:
            reg[a] = reg[b] * reg[c]
        elif op == 5:
            reg[a] = reg[b] / reg[c]
        elif op == 6:
            reg[a] = ~(reg[b] & reg[c])
        elif op == 8:
            var newplatter = List[UInt32]()
            newplatter.resize(int(reg[c]), 0)
            platters.append(newplatter)
            reg[b] = len(platters) - 1
        elif op == 9:
            platters[int(reg[c])].resize(0)
        elif op == 10:
            io._put(chr(int(reg[c])))
        elif op == 12:
            if reg[b] != 0:
                platters[0] = platters[int(reg[b])]
            finger = int(reg[c])
        elif op == 13:
            reg[int((v >> 25) & 0b111)] = v & 0b1111111111111111111111111
        else:
            print("unhandled opcode: " + str(op))
            break

The above runs at ~105k ops/sec, while the Go program linked above runs at ~298,040k ops/sec on my M2 Mac.

There are a few differences that impact performance but to a factor much less than the 2,800x - such as the linear if/else vs switch. But looking at a trace - it's clear almost all of the time is being spent in alloc/free.

It does not matter whether reg is List[UInt32] or SIMD[DType.uint32, 8] - performance is ~the same for both.

Ultimately - a few questions:

Is this expected performance for the code as written?
Is there a way that this should be written to get better performance?
Is there a recommended way to profile and/or gain insight on where copies/frees will be generated and why?

Answered by rd4com

May 28, 2024

Hello @lukehoban , using references it should be faster! (Make sure to write some tests)

# nightly 2024.5.2705 (a737cd65)

fn main() raises:
    var program = List[UInt32]()
    var f = file.open("sandmark.umz", "r")
    while True:
        var byts = f.read_bytes(4)
        if len(byts) < 4:
            break
        var i: UInt32 = 0
        for n in range(4):
            i = (i << 8) + byts[n].cast[DType.uint8]().cast[DType.uint32]()
        program.append(i)
    var platters = List[List[UInt32]](program)
    var reg = SIMD[DType.uint32, 8](0,0,0,0,0,0,0,0)
    var finger = 0
    var iteration = 0
    var start = time.now()
    while True:
        var v = platters.__get_ref(0)[].__get_…

View full answer

rd4com · 2024-05-28T06:38:31Z

rd4com
May 28, 2024

Hello @lukehoban , using references it should be faster! (Make sure to write some tests)

# nightly 2024.5.2705 (a737cd65)

fn main() raises:
    var program = List[UInt32]()
    var f = file.open("sandmark.umz", "r")
    while True:
        var byts = f.read_bytes(4)
        if len(byts) < 4:
            break
        var i: UInt32 = 0
        for n in range(4):
            i = (i << 8) + byts[n].cast[DType.uint8]().cast[DType.uint32]()
        program.append(i)
    var platters = List[List[UInt32]](program)
    var reg = SIMD[DType.uint32, 8](0,0,0,0,0,0,0,0)
    var finger = 0
    var iteration = 0
    var start = time.now()
    while True:
        var v = platters.__get_ref(0)[].__get_ref(finger)
        finger = finger+1
        var op = v[] >> 28
        var a = int((v[] >> 6) & 0b111)
        var b = int((v[] >> 3) & 0b111)
        var c = int((v[] >> 0) & 0b111)
        iteration = iteration + 1
        if iteration % 1000000 == 0:
            print((Float32(iteration) * 1e9) / Float32(time.now()-start), " ops/sec")
        if op == 0:
            if reg[c] != 0:
                reg[a] = reg[b]
        elif op == 1:
            reg[a] = platters.__get_ref(int(reg[b]))[].__get_ref(int(reg[c]))[]
        elif op == 2:
            platters.__get_ref(int(reg[a]))[].__get_ref(int(reg[b]))[] = reg[c]
        elif op == 3:
            reg[a] = reg[b] + reg[c]
        elif op == 4:
            reg[a] = reg[b] * reg[c]
        elif op == 5:
            reg[a] = reg[b] / reg[c]
        elif op == 6:
            reg[a] = ~(reg[b] & reg[c])
        elif op == 8:
            var newplatter = List[UInt32]()
            newplatter.resize(int(reg[c]), 0)
            platters.append(newplatter)
            reg[b] = len(platters) - 1
        elif op == 9:
            platters[int(reg[c])].resize(0)
        elif op == 10:
            io._put(chr(int(reg[c])))
        elif op == 12:
            if reg[b] != 0:
                platters[0] = platters[int(reg[b])]
            finger = int(reg[c])
        elif op == 13:
            reg[int((v[] >> 25) & 0b111)] = v[] & 0b1111111111111111111111111
        else:
            print("unhandled opcode: " + str(op))
            break

0 replies

lukehoban · 2024-05-28T14:54:44Z

lukehoban
May 28, 2024
Author

Thanks @rd4com. That indeed addresses the significant performance difference!

Is this expected performance for the code as written?

IIUC - this is the same issue as tracked in #2432 and will be fixed via #2847?

1 reply

rd4com Jun 20, 2024

Hello @lukehoban,

Auto-dereference removes the need for explicit dereferencing [];

People can enjoy the benefits of references before even learning them 🤯 👍

It is a feature, you can actually choose to return either an auto one or a "classic" one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surprisingly slow performance on tight loop with integer operations on SIMD/List get/set #2857

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Surprisingly slow performance on tight loop with integer operations on SIMD/List get/set #2857

lukehoban May 27, 2024

Replies: 2 comments · 1 reply

rd4com May 28, 2024

lukehoban May 28, 2024 Author

rd4com Jun 20, 2024

lukehoban
May 27, 2024

Replies: 2 comments 1 reply

rd4com
May 28, 2024

lukehoban
May 28, 2024
Author