<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://www.memcp.org/index.php?action=history&amp;feed=atom&amp;title=Sequence_Compression</id>
	<title>Sequence Compression - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://www.memcp.org/index.php?action=history&amp;feed=atom&amp;title=Sequence_Compression"/>
	<link rel="alternate" type="text/html" href="https://www.memcp.org/index.php?title=Sequence_Compression&amp;action=history"/>
	<updated>2026-04-12T23:49:29Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://www.memcp.org/index.php?title=Sequence_Compression&amp;diff=41&amp;oldid=prev</id>
		<title>Carli: Created page with &quot;One of the most interesting compression techniques on columnar storages is &#039;&#039;&#039;sequence compression&#039;&#039;&#039;. Sequence Compression in In-Memory Database yields 99% memory savings and a total of 13%  A sequence is a column of numbers where each distance between two neighbouring numbers is equal. Example:  * &lt;code&gt;1 2 3 4 5 6 7 8 9&lt;/code&gt; * &lt;code&gt;3 3 3 3 3 3 3 3 3 3&lt;/code&gt; * &lt;code&gt;10 20 30 40&lt;/code&gt; * &lt;code&gt;8 7 6 5&lt;/code&gt;  A sequence can be described by its starting point, its st...&quot;</title>
		<link rel="alternate" type="text/html" href="https://www.memcp.org/index.php?title=Sequence_Compression&amp;diff=41&amp;oldid=prev"/>
		<updated>2024-05-17T13:04:07Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;One of the most interesting compression techniques on columnar storages is &amp;#039;&amp;#039;&amp;#039;sequence compression&amp;#039;&amp;#039;&amp;#039;. Sequence Compression in In-Memory Database yields 99% memory savings and a total of 13%  A sequence is a column of numbers where each distance between two neighbouring numbers is equal. Example:  * &amp;lt;code&amp;gt;1 2 3 4 5 6 7 8 9&amp;lt;/code&amp;gt; * &amp;lt;code&amp;gt;3 3 3 3 3 3 3 3 3 3&amp;lt;/code&amp;gt; * &amp;lt;code&amp;gt;10 20 30 40&amp;lt;/code&amp;gt; * &amp;lt;code&amp;gt;8 7 6 5&amp;lt;/code&amp;gt;  A sequence can be described by its starting point, its st...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;One of the most interesting compression techniques on columnar storages is &amp;#039;&amp;#039;&amp;#039;sequence compression&amp;#039;&amp;#039;&amp;#039;. Sequence Compression in In-Memory Database yields 99% memory savings and a total of 13%&lt;br /&gt;
&lt;br /&gt;
A sequence is a column of numbers where each distance between two neighbouring numbers is equal. Example:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;1 2 3 4 5 6 7 8 9&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;3 3 3 3 3 3 3 3 3 3&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;10 20 30 40&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;8 7 6 5&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A sequence can be described by its starting point, its stride as well as its length:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;(1,1,9)&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;(3,0,10)&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;(10,10,4)&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;(8,-1,4)&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The longer the sequence, the higher the compression ratio. For a typical SQL engine workload with AUTO_INCREMENT IDs, this means that you can store 150,000 IDs by just using three numbers.&lt;br /&gt;
&lt;br /&gt;
Whenever the sequence is interrupted, a new sequence has to be described. This means the column&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;1 2 3 4 6 7 8&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
will be encoded as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;(1,1,4)(6,1,3)&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
which is nearly the same storage-wise as the uncompressed column.&lt;br /&gt;
&lt;br /&gt;
So the compression ratio is as bigger as the the column is regular. Since we use bit compression for the stride, we can use a heuristic that less than 30% of the values are allowed to trigger a sequence-restart.&lt;br /&gt;
&lt;br /&gt;
== Tweaking the format for fast random access ==&lt;br /&gt;
To efficiently read a sequence compressed column, we have to alter the format a little bit. Instead of recording &amp;lt;code&amp;gt;(start, stride, count)&amp;lt;/code&amp;gt;, we will record &amp;lt;code&amp;gt;(recordId, start, stride)&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
The value &amp;lt;code&amp;gt;count&amp;lt;/code&amp;gt; is automatically derived from the difference between its successor recordId. But as you see, we won’t need it anyways. &amp;lt;code&amp;gt;recordId&amp;lt;/code&amp;gt; is a unsigned integer counting seemlessly from 0 to array_size-1. Whenever we want to find a specific value, we can use the following algorithm:&lt;br /&gt;
&lt;br /&gt;
* Given &amp;lt;code&amp;gt;i&amp;lt;/code&amp;gt; is the recordId we want to read out&lt;br /&gt;
* Do a binary search through the array of sequences such that we get the lowest &amp;lt;code&amp;gt;x&amp;lt;/code&amp;gt; with &amp;lt;code&amp;gt;sequence[x].recordId &amp;lt;= i&amp;lt;/code&amp;gt;&lt;br /&gt;
* Calculate the result by &amp;lt;code&amp;gt;sequence[x].start + (i - sequence[x].recordId) * sequence[x].stride&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Evaluation ==&lt;br /&gt;
In our example with OppelBI data (mostly online shop statistics), we were able to compress some integer storages by 99% which yields an overall saving of 13% of storage (our 55MB MySQL data is compressed to 15MB rather than 17MB)&lt;br /&gt;
&lt;br /&gt;
For the TPC-H benchmark, we achieved a saving of 7MB for the 1.1GB dataset (scale factor 1) which is only 0,8%. We attribute this to the randomness of TPC-H’s data. Real world ERP or shop data is much more regular and less string-heavvy than TPC-H’s benchmark data.&lt;br /&gt;
&lt;br /&gt;
== Optimized for AI workloads ==&lt;br /&gt;
Especially AI workloads like this table:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;AIInferenceMatrix(matrixId integer, column integer, row integer, value double)&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
can be sequence-compressed very efficiently if you keep the row and column values sorted. This also yields a matrix’s &amp;lt;code&amp;gt;value&amp;lt;/code&amp;gt; column memory layout that exactly matches the internal representation of the matrix that can be directly passed to TensorFlow or similar AI libraries.&lt;br /&gt;
&lt;br /&gt;
This implies that AI workloads no longer have to be stored as stupid BLOBs but rather can get a full relational storage which means easier access for partial data.&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
Sequence Compression is a powerful technique to further compress data in RAM. It reduces cache misses and thus fastens up random access to „boring“ data.&lt;/div&gt;</summary>
		<author><name>Carli</name></author>
	</entry>
</feed>