Examples for loading / parsing file formats?

PythagoRascal · 6 November 2024 13:54

Hi, I’m looking for examples on how to build a parsing/loading function for an arbitrary/custom/thirdparty binary file format.

Specific example, I want to make a loader for Niantic’s SPZ format for Gaussian Splats. Looking at it, I’d start with something like this, as the files are compressed:

import "core:bytes"
import "core:compress/gzip"

Packed_Gaussians_Header :: struct { /*...*/ }

parse :: proc(filepath: string) -> Gaussian_Cloud {
    buf: bytes.Buffer{}
    gzip.load(filepath, &buf)
    // TODO: how to get header data out of decompressed buf?
    // Like, is there somthing like:
    // header := read_type(&buf, Packed_Gaussians_Header)
    // TODO: parse packed gaussians and return parsed data
}

How do I best get data out of buf?

Do I mess with / track index into buf.buf directly? (using transmute, etc.)
Can I use bytes.peek_data() somehow? Seems like there is no way right now to get at the compress.Context_Memory_Input that is created in gzip.load.

What is the recommended way to go about something like this? Are there any file/format parsing/loading examples?

Barinzaya · 6 November 2024 14:30

The decompression will already be done, and buf will contain the decompressed data. You don’t need to see anything from the decompression context. Don’t forget to use bytes.buffer_destroy on buf when you’re done, it will have allocated space for the decompressed data.

For accessing the data, you can just pick at buf.buf directly and track the offset yourself if you want to. core:encoding/hxa works using a buffer and offset-tracking.

An alternative to tracking positions would be to use the bytes.buffer_* functions (e.g. buffer_read or buffer_next). The buffer tracks its own offset internally, so you can use it like a stream. ~~In that case, the buffer’s offset will initially be at the end of the buffer where it was writing; you can seek to the beginning with bytes.buffer_seek(&buf, 0, .Start) before you start reading.~~ It looks like the buffer will already be set up with the cursor at the start, so you can probably just start reading.

You can also wrap it in an io.Stream with bytes.buffer_to_stream and then use the read functions from core:io, if you want. That’s just a wrapper around the same buffer_* functions, ~~and thus will also require you to seek to the start before you start reading.~~ core:encoding/cbor works using a stream (io.Reader, technically, but it’s the same thing).

PythagoRascal · 6 November 2024 18:45

Thanks for the pointers. I’ll look more at those two packages, but I think I’ve already seen a few bits that will help me along.

Maybe another question: is it generally the more common/the “new way” to wrap with io.Stream when parsing? I think I saw a comment in core:compress/zlib somewhere about implementing a version with streams in the future, which made me think that could be the case. Or is it still very individual?

Barinzaya · 6 November 2024 19:07

It’s mostly a matter of the flexibility you need, I’d say. io.Stream is useful in that it’s generic and can stream directly to/from a variety of targets (e.g. memory, a file, the network, etc.), so you don’t need to load the entire compressed data into memory to decompress it. That’s particularly useful when it comes to compressed data, since sometimes that data can be large and needing all of the compressed data (or worse, the uncompressed data) to be in memory isn’t always practical. It can also be useful for network data, as it allows you to start processing data before it’s all arrived.

Using io.Stream also adds some slight additional complication, particularly with operations being able to fail. Needing to have the data in memory isn’t really an issue for some circumstances, so it’s a trade-off; use what’s most appropriate for your use case.

stjo · 6 November 2024 22:38

I’m not too experienced in odin myself, but here’s how I wrote my simple gltf (glb to be precise) loader:

load_glb :: proc(data: []byte) -> (mesh: Mesh, err: B3DM_Error)  {
	JSON := [4]byte{'J', 'S', 'O', 'N'}
	 BIN := [4]byte{'B', 'I', 'N', 0}

	GlobalHeader :: struct { magic, version, length: u32 }
	gh := cast(^GlobalHeader)raw_data(data)
	if gh.magic != 1179937895      { return mesh, .Bad_File }
	if gh.version != 2             { return mesh, .Unsupported_Feature }
	if gh.length != u32(len(data)) { return mesh, .Bad_File }

	ChunkHeader :: struct { length: u32, type: [4]byte }
	ch0 := cast(^ChunkHeader)raw_data(data[size_of(GlobalHeader):])
	if mem.compare(ch0.type[:], JSON[:]) != 0 { return mesh, .Bad_File }
	cd0 := data[size_of(GlobalHeader) + size_of(ChunkHeader):][:ch0.length]

	ch1 := cast(^ChunkHeader)raw_data(data[size_of(GlobalHeader) + size_of(ChunkHeader) + ch0.length:])
	if mem.compare(ch1.type[:], BIN[:]) != 0 { return mesh, .Bad_File }
	cd1 := data[size_of(GlobalHeader) + size_of(ChunkHeader) + ch0.length + size_of(ChunkHeader):][:ch1.length]

	glb: GLB
	json.unmarshal(cd0, &glb) or_return

	if len(glb.meshes) != 1               { return mesh, .Unsupported_Feature }
	if len(glb.meshes[0].primitives) != 1 { return mesh, .Unsupported_Feature }
	m0 := glb.meshes[0].primitives[0]
....

I’m defining a struct and then casting parts of the buffer into the struct. The reason the procedure takes a slice is because the glb is embeded into a b3dm file, parsed in a similar way


load_b3dm :: proc(path: string) -> (mesh: TileMesh, err: B3DM_Error) {
	Header :: struct {
		magic, version, byte_len: u32,
		feature_table_json_len, feature_table_bin_len: u32,
		batch_table_json_len, batch_table_bin_len: u32,
	}
	Feature_Table :: struct { RTC_CENTER: [3]f64 }

	data := os.read_entire_file_or_err(path) or_return

	if len(data) < size_of(Header) { return mesh, .File_Too_Small }
	h := cast(^Header)raw_data(data)
	if h.magic != 1835283298         { return mesh, .Bad_File }
	if h.version != 1                { return mesh, .Unsupported_Feature }
    if h.byte_len != u32(len(data))  { return mesh, .Bad_File }
    if h.feature_table_json_len == 0 { return mesh, .Unsupported_Feature }
	if h.feature_table_bin_len != 0  { return mesh, .Unsupported_Feature }
	if h.batch_table_json_len != 0   { return mesh, .Unsupported_Feature }
	if h.batch_table_bin_len != 0    { return mesh, .Unsupported_Feature }

	ft: Feature_Table
	json.unmarshal(data[size_of(Header):][:h.feature_table_json_len], &ft) or_return
	mesh.rtc_center = ft.RTC_CENTER

	mesh.mesh = load_glb(data[size_of(Header) + h.feature_table_json_len:]) or_return

	delete(data)
	return
}

stjo · 6 November 2024 22:40

Btw, may I ask what are you building? I’m working on many large scale photogrammetry projects (scene reconstruction, object recognition) and gaussian splats is something I’ve been meaning to get into but I never had the time.

PythagoRascal · 6 November 2024 23:29

Thanks for these examples, they are really helpful.

Often, it’s not that I don’t know how to achieve a certain goal, but rather I struggle with awareness of the tools already available and the reasoning for when to choose which. So, agian thanks for the examples, they help filling this gap for me.

PythagoRascal · 6 November 2024 23:32

For now I’m just playing around, trying to learn about splats. Maybe write a simple renderer for them with raylib. Depending on how it goes, I might use that knowledge in my dayjob (AR/VR, mostly Unity though).

x86 · 7 November 2024 15:16

Look here: Odin/core/encoding at master · odin-lang/Odin · GitHub

Also I have made an ELF parser here: https://git.sr.ht/~slendi/elf_dwarf_parser/blob/master/elf/elf.odin