Fast shuffle of 128 bits blocs with 256 bit registers

vincent_fabro · 15 March 2025 06:00

Hello all

I would like, in the fastest way possible, from two 256-bit registers, take the highest 128 bits from the first register and the lowest 128 bits from the second register, e.g.:

a := [2]u128{1111, 2222}
b := [2]u128{3333, 5555}
result := … // expected: [2222, 3333]

Context: porting simdjson to Odin ( simdjson/include/simdjson/haswell/simd.h at master · simdjson/simdjson · GitHub , call ‘_mm256_permute2x128_si256’)

A)

There’s the ‘_mm256_permute2x128_si256’/‘vperm2i128’ intrinsic. But I can’t seem to use it from Odin (and maybe from LLVM?):

@(require_results, enable_target_feature=“avx2”)
_mm256_permute2x128_si256_test :: #force_inline proc “c” (a, b: simd.u8x32, idx: u8) → simd.u8x32 {
return vperm2i128(a, b, idx)
}

@(private, default_calling_convention=“none”)
foreign _ {
@(link_name = “llvm.x86.avx2.vperm2i128”)
vperm2i128 :: proc(a, b: simd.u8x32, idx: u8) → simd.u8x32 —
}

@(enable_target_feature=“avx2”)
main :: proc() {
a := [2]u128{1111, 2222}
b := [2]u128{3333, 5555}
// miserable failure: undefined reference to `llvm.x86.avx2.vperm2i128'
result := transmute([2]u128)_mm256_permute2x128_si256_test(transmute(simd.u8x32)a, transmute(simd.u8x32)b, 0x21)
fmt.printfln("expect [2222, 3333]: %v", result)
}

It fails, the name ‘llvm.x86.avx2.vperm2i128’ has been removed a long time ago apparently ( ⚙ D37892 [X86] Use native shuffle vector for the perm2f128 intrinsics , line 910)
and replaced by an instruction ‘shufflevector’.
‘shufflevector’ itself may call ‘vperm2i128’ under the right conditions I guess ? ( llvm-project/llvm/test/CodeGen/X86/avx-vperm2x128.ll at b003face11fadc526a6f816243441f486ffc958d · llvm/llvm-project · GitHub )

B)

There’s also ‘simd.shuffle’ :

main :: proc() {
a := transmute(simd.u64x4)[2]u128{1111, 2222}
b := transmute(simd.u64x4)[2]u128{3333, 5555}
result1 := simd.shuffle(a, b, 2, 3, 4, 5)
fmt.printfln("simd.shuffle: %v", transmute([2]u128)result1)
}

It does the job, but it creates a bunch of ‘movaps’ apparently (after checking with ‘godbolt.org’).
Note: I’m out of my depth here, mistakes are likely.

Thanks in advance for the help!

Barinzaya · 15 March 2025 17:06

When using Godbolt, don’t forget to use the right compiler flags (or proc attributes). You’ll want to check with optimization enabled and any target features you expect it to use. It absolutely will use vperm2f128 for simd.shuffle, but the default compile settings won’t use AVX/AVX2.

Funny enough, it’s actually the same number of instructions as it is without AVX2–but there’ll be some overhead due to it adapting to the proc’s parameter/return conventions that won’t have the same impact when it’s used as a part of a larger proc.

In my experience with Godbolt, your best bet for reading the assembly is to define a proc that takes any variable inputs and returns the output. If you define your inputs as constants, LLVM will often just perform the operation at compile-time and turn the result into a constant.

Additionally, main tends to get inlined into runtime code that Godbolt won’t show, so if you use main you won’t always see it.

vincent_fabro · 16 March 2025 02:28

Thank you ! That is invaluable info