Integer Compression: Difference between revisions

From MemCP
Jump to navigation Jump to search
(Created page with "Traditional Database Management Systems use fixed width integers of 32 or 64 bit that are administered by the database operator. MemCP however uses a different approach: according to a data analysis pass, we detect the maximum bit width of an integer to be encoded into our database. This is done by: * finding the smallest integer in a shard * finding the highest integer * <code>offset = smallest</code> * <code>bitwidth = ⌈ld (highest - smallest)⌉</code> * allocate...")
(No difference)

Revision as of 22:43, 17 May 2024

Traditional Database Management Systems use fixed width integers of 32 or 64 bit that are administered by the database operator.

MemCP however uses a different approach: according to a data analysis pass, we detect the maximum bit width of an integer to be encoded into our database.

This is done by:

  • finding the smallest integer in a shard
  • finding the highest integer
  • offset = smallest
  • bitwidth = ⌈ld (highest - smallest)⌉
  • allocate ⌈(num_items * bitwidth) / 64⌉ 64-bit integers
  • the values are stored at the position item_id * bitwidth inside that memory blob

The analysis and compression is done in the ColumnStorage interface.

Integer compression can lead to extensive memory savings:

  • Binary values use up 1 bit instead of 8 bits or 32 bits per value
  • through the offset handling, unix timestamps can be encoded in 23 bits instead of 32 bits
  • IDs and other foreign keys can be encoded in ⌈ld highest_ID⌉ bits per item

Besides the memory savings, this compression can lead to massive speedups due in scenarios where memory bandwith or latency is a bottleneck.

See also

https://github.com/lemire/FastPFor