Dictionary Compression

From MemCP
Revision as of 21:06, 22 August 2025 by Carli (talk | contribs) (Created page with "Strings are the nightmare of every database. You never know their exact size and you have to either reference them in a separate storage or reserve enough character space to store them in-table. Columnar databases can do better. But that's not guaranteed, you have to put a bit of extra effort into it. Here's how: At first, memcp converts strings into dictionaries. If you have a large list consisting of [Male, Male, Male, Female, Male, Female, Male, Male], you only have...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Strings are the nightmare of every database. You never know their exact size and you have to either reference them in a separate storage or reserve enough character space to store them in-table.

Columnar databases can do better. But that's not guaranteed, you have to put a bit of extra effort into it. Here's how:

At first, memcp converts strings into dictionaries. If you have a large list consisting of [Male, Male, Male, Female, Male, Female, Male, Male], you only have two distinct values. So the dictionary can hold "Male,Female" and then you can reference them by integers (0,0,0,1,0,1,0,0).

Then, these integers can be Integer Compressed. And if that's not enough, you can also add Sequence Compression.

With dictionary compression, you can achieve compression ratios of 10x and higher.

Here's an example from (print (stat schema table))

Shard 6254
---
main count: 61440, delta count: 0, deletions: 0
 mode: string-dict[1 entries; 5 bytes], size = 7.833KiB
 cntin: seq[176x int[20]/int[20]], size = 1.477KiB
 cntout: seq[187x int[6]/int[7]], size = 968B
 duration: int[15], size = 112.6KiB
 filters: seq[161x int[1]/int[2]], size = 680B
 append: seq[176x int[6]/int[7]], size = 928B
 p: string-dict[3 entries; 43 bytes], size = 15.36KiB
 ---
 ---
 + insertions 40B
 + deletions 48B
 ---
= total 140KiB

As you can see, the dictionary can store massive amounts of data (61k items) in 15.4 KiB for a 3-way dictionary. That's 2 bits per item!