Integer Compression: Difference between revisions

Revision as of 21:43, 17 May 2024

Traditional Database Management Systems use fixed width integers of 32 or 64 bit that are administered by the database operator.

MemCP however uses a different approach: according to a data analysis pass, we detect the maximum bit width of an integer to be encoded into our database.

This is done by:

finding the smallest integer in a shard
finding the highest integer
offset = smallest
bitwidth = ⌈ld (highest - smallest)⌉
allocate ⌈(num_items * bitwidth) / 64⌉ 64-bit integers
the values are stored at the position item_id * bitwidth inside that memory blob

The analysis and compression is done in the ColumnStorage interface.

Integer compression can lead to extensive memory savings:

Binary values use up 1 bit instead of 8 bits or 32 bits per value
through the offset handling, unix timestamps can be encoded in 23 bits instead of 32 bits
IDs and other foreign keys can be encoded in ⌈ld highest_ID⌉ bits per item

Besides the memory savings, this compression can lead to massive speedups due in scenarios where memory bandwith or latency is a bottleneck.

Integer Compression: Difference between revisions

Revision as of 21:43, 17 May 2024

See also

Navigation menu

Integer Compression: Difference between revisions

Revision as of 21:43, 17 May 2024

See also

Navigation menu

Search