| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 1 | Thrift Compact protocol encoding |
| 2 | ================================ |
| 3 | |
| Jens Geyer | 5767901 | 2016-09-21 22:18:44 +0200 | [diff] [blame] | 4 | <!-- |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 5 | -------------------------------------------------------------------- |
| 6 | |
| 7 | Licensed to the Apache Software Foundation (ASF) under one |
| 8 | or more contributor license agreements. See the NOTICE file |
| 9 | distributed with this work for additional information |
| 10 | regarding copyright ownership. The ASF licenses this file |
| 11 | to you under the Apache License, Version 2.0 (the |
| 12 | "License"); you may not use this file except in compliance |
| 13 | with the License. You may obtain a copy of the License at |
| 14 | |
| 15 | http://www.apache.org/licenses/LICENSE-2.0 |
| 16 | |
| 17 | Unless required by applicable law or agreed to in writing, |
| 18 | software distributed under the License is distributed on an |
| 19 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 20 | KIND, either express or implied. See the License for the |
| 21 | specific language governing permissions and limitations |
| 22 | under the License. |
| 23 | |
| 24 | -------------------------------------------------------------------- |
| Jens Geyer | 5767901 | 2016-09-21 22:18:44 +0200 | [diff] [blame] | 25 | --> |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 26 | |
| 27 | This documents describes the wire encoding for RPC using the Thrift *compact protocol*. |
| 28 | |
| 29 | The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and |
| 30 | [THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however, |
| 31 | should behave the same. |
| 32 | |
| 33 | For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf). |
| 34 | |
| 35 | # Contents |
| 36 | |
| 37 | * Compact protocol |
| 38 | * Base types |
| 39 | * Message |
| 40 | * Struct |
| 41 | * List and Set |
| 42 | * Map |
| 43 | * BNF notation used in this document |
| 44 | |
| 45 | # Compact protocol |
| 46 | |
| 47 | ## Base types |
| 48 | |
| 49 | ### Integer encoding |
| 50 | |
| 51 | The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_. |
| 52 | |
| 53 | Values of type `int32` and `int64` are first transformed to a *zigzag int*. A zigzag int folds positive and negative |
| 54 | numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1, |
| 55 | -2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back: |
| 56 | |
| 57 | ```scala |
| 58 | def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31) |
| 59 | def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1) |
| 60 | def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63) |
| 61 | def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1) |
| 62 | ``` |
| 63 | |
| Juan Cruz Viotti | 2e7f39f | 2021-01-20 17:05:19 -0400 | [diff] [blame^] | 64 | The zigzag int is then encoded as a *var int*, also known as *Unsigned LEB128*. Var ints take 1 to 5 bytes (int32) or |
| 65 | 1 to 10 bytes (int64). The process consists in taking a Big Endian unsigned integer, left-padding the bit-string to |
| 66 | make it a multiple of 7 bits, splitting it into 7-bit groups, prefixing the most-significant 7-bit group with the 0 |
| 67 | bit, prefixing the remaining 7-bit groups with the 1 bit and encoding the resulting bit-string in Little Endian. |
| 68 | |
| 69 | For example, the integer 50399 is encoded as follows: |
| 70 | |
| 71 | ``` |
| 72 | 50399 = 1100 0100 1101 1111 (Big Endian representation) |
| 73 | = 00000 1100 0100 1101 1111 (Left-padding) |
| 74 | = 0000011 0001001 1011111 (7-bit groups) |
| 75 | = 00000011 10001001 11011111 (Most-significant bit prefixes) |
| 76 | = 11011111 10001001 00000011 (Little Endian representation) |
| 77 | = 0xDF 0x89 0x03 |
| 78 | ``` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 79 | |
| 80 | Var ints are sometimes used directly inside the compact protocol to represent positive numbers. |
| 81 | |
| 82 | To encode an `int16` as zigzag int, it is first converted to an `int32` and then encoded as such. The type `int8` simply |
| 83 | uses a single byte as in the binary protocol. |
| 84 | |
| 85 | ### Enum encoding |
| 86 | |
| 87 | The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32. |
| 88 | |
| 89 | ### Binary encoding |
| 90 | |
| 91 | Binary is sent as follows: |
| 92 | |
| 93 | ``` |
| 94 | Binary protocol, binary data, 1+ bytes: |
| 95 | +--------+...+--------+--------+...+--------+ |
| 96 | | byte length | bytes | |
| 97 | +--------+...+--------+--------+...+--------+ |
| 98 | ``` |
| 99 | |
| 100 | Where: |
| 101 | |
| 102 | * `byte length` is the length of the byte array, using var int encoding (must be >= 0). |
| 103 | * `bytes` are the bytes of the byte array. |
| 104 | |
| 105 | ### String encoding |
| 106 | |
| Juan Cruz Viotti | 47b3d3b | 2021-01-21 12:22:47 -0400 | [diff] [blame] | 107 | *String*s are first encoded to UTF-8, and then send as binary. They do not |
| 108 | include a NUL delimiter. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 109 | |
| 110 | ### Double encoding |
| 111 | |
| 112 | Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit |
| Jens Geyer | 450bc69 | 2019-12-03 23:28:03 +0100 | [diff] [blame] | 113 | layout. Most run-times provide a library to make this conversion. But while the binary protocol encodes the int64 |
| 114 | in 8 bytes in big endian order, the compact protocol encodes it in little endian order - this is due to an early |
| 115 | implementation bug that finally became the de-facto standard. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 116 | |
| 117 | ### Boolean encoding |
| 118 | |
| 119 | Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set, |
| 120 | list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an int8; |
| 121 | true as `1` and false as `0`. |
| 122 | |
| 123 | ## Message |
| 124 | |
| 125 | A `Message` on the wire looks as follows: |
| 126 | |
| 127 | ``` |
| 128 | Compact protocol Message (4+ bytes): |
| 129 | +--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+ |
| 130 | |pppppppp|mmmvvvvv| seq id | name length | name | |
| 131 | +--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+ |
| 132 | ``` |
| 133 | |
| 134 | Where: |
| 135 | |
| 136 | * `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82. |
| 137 | * `mmm` is the message type, an unsigned 3 bit integer. |
| 138 | * `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`. |
| 139 | * `seq id` is the sequence id, a signed 32 bit integer encoded as a var int. |
| 140 | * `name length` is the byte length of the name field, a signed 32 bit integer encoded as a var int (must be >= 0). |
| 141 | * `name` is the method name to invoke, a UTF-8 encoded string. |
| 142 | |
| 143 | Message types are encoded with the following values: |
| 144 | |
| 145 | * _Call_: 1 |
| 146 | * _Reply_: 2 |
| 147 | * _Exception_: 3 |
| 148 | * _Oneway_: 4 |
| 149 | |
| 150 | ### Struct |
| 151 | |
| 152 | A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and |
| 153 | is followed by the encoded field value. The encoding can be summarized by the following BNF: |
| 154 | |
| 155 | ``` |
| 156 | struct ::= ( field-header field-value )* stop-field |
| 157 | field-header ::= field-type field-id |
| 158 | ``` |
| 159 | |
| 160 | Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any |
| 161 | order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also |
| 162 | possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to |
| 163 | determine how to decode the field value. |
| 164 | |
| 165 | Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility. |
| 166 | |
| 167 | The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has |
| Klaus Trainer | e41e47c | 2017-05-17 11:11:19 +0200 | [diff] [blame] | 168 | another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 169 | Other implementation may perform this check and then either ignore the field, or return a protocol exception. |
| 170 | |
| 171 | A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded. |
| 172 | |
| 173 | An *Exception* is encoded exactly the same as a struct. |
| 174 | |
| 175 | ### Struct encoding |
| 176 | |
| 177 | ``` |
| 178 | Compact protocol field header (short form) and field value: |
| 179 | +--------+--------+...+--------+ |
| 180 | |ddddtttt| field value | |
| 181 | +--------+--------+...+--------+ |
| 182 | |
| 183 | Compact protocol field header (1 to 3 bytes, long form) and field value: |
| 184 | +--------+--------+...+--------+--------+...+--------+ |
| 185 | |0000tttt| field id | field value | |
| 186 | +--------+--------+...+--------+--------+...+--------+ |
| 187 | |
| 188 | Compact protocol stop field: |
| 189 | +--------+ |
| 190 | |00000000| |
| 191 | +--------+ |
| 192 | ``` |
| 193 | |
| 194 | Where: |
| 195 | |
| 196 | * `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive. |
| 197 | * `tttt` is field-type id, an unsigned 4 bit integer. |
| 198 | * `field id` the field id, a signed 16 bit integer encoded as zigzag int. |
| 199 | * `field-value` the encoded field value. |
| 200 | |
| 201 | The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the |
| 202 | first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive). |
| 203 | |
| 204 | The following field-types can be encoded: |
| 205 | |
| 206 | * `BOOLEAN_TRUE`, encoded as `1` |
| 207 | * `BOOLEAN_FALSE`, encoded as `2` |
| 208 | * `BYTE`, encoded as `3` |
| 209 | * `I16`, encoded as `4` |
| 210 | * `I32`, encoded as `5` |
| 211 | * `I64`, encoded as `6` |
| 212 | * `DOUBLE`, encoded as `7` |
| 213 | * `BINARY`, used for binary and string fields, encoded as `8` |
| 214 | * `LIST`, encoded as `9` |
| 215 | * `SET`, encoded as `10` |
| 216 | * `MAP`, encoded as `11` |
| 217 | * `STRUCT`, used for both structs and union fields, encoded as `12` |
| 218 | |
| 219 | Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no |
| 220 | length (0 bytes). |
| 221 | |
| 222 | ## List and Set |
| 223 | |
| 224 | List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the |
| 225 | encoded elements. |
| 226 | |
| 227 | ``` |
| 228 | Compact protocol list header (1 byte, short form) and elements: |
| 229 | +--------+--------+...+--------+ |
| 230 | |sssstttt| elements | |
| 231 | +--------+--------+...+--------+ |
| 232 | |
| 233 | Compact protocol list header (2+ bytes, long form) and elements: |
| 234 | +--------+--------+...+--------+--------+...+--------+ |
| 235 | |1111tttt| size | elements | |
| 236 | +--------+--------+...+--------+--------+...+--------+ |
| 237 | ``` |
| 238 | |
| 239 | Where: |
| 240 | |
| 241 | * `ssss` is the size, 4 bit unsigned int, values `0` - `14` |
| 242 | * `tttt` is the element-type, a 4 bit unsigned int |
| 243 | * `size` is the size, a var int (int32), positive values `15` or higher |
| 244 | * `elements` are the encoded elements |
| 245 | |
| 246 | The short form should be used when the length is in the range 0 - 14 (inclusive). |
| 247 | |
| 248 | The following element-types are used (note that these are _different_ from the field-types): |
| 249 | |
| 250 | * `BOOL`, encoded as `2` |
| 251 | * `BYTE`, encoded as `3` |
| 252 | * `DOUBLE`, encoded as `4` |
| 253 | * `I16`, encoded as `6` |
| 254 | * `I32`, encoded as `8` |
| 255 | * `I64`, encoded as `10` |
| 256 | * `STRING`, used for binary and string fields, encoded as `11` |
| 257 | * `STRUCT`, used for structs and union fields, encoded as `12` |
| 258 | * `MAP`, encoded as `13` |
| 259 | * `SET`, encoded as `14` |
| 260 | * `LIST`, encoded as `15` |
| 261 | |
| 262 | |
| 263 | The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: |
| 264 | 2147483647). |
| 265 | |
| 266 | ## Map |
| 267 | |
| 268 | Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed |
| 269 | by the encoded elements. The encoding follows this BNF: |
| 270 | |
| 271 | ``` |
| 272 | map ::= empty-map | non-empty-map |
| 273 | empty-map ::= `0` |
| 274 | non-empty-map ::= size key-element-type value-element-type (key value)+ |
| 275 | ``` |
| 276 | |
| 277 | ``` |
| 278 | Compact protocol map header (1 byte, empty map): |
| 279 | +--------+ |
| 280 | |00000000| |
| 281 | +--------+ |
| 282 | |
| 283 | Compact protocol map header (2+ bytes, non empty map) and key value pairs: |
| 284 | +--------+...+--------+--------+--------+...+--------+ |
| 285 | | size |kkkkvvvv| key value pairs | |
| 286 | +--------+...+--------+--------+--------+...+--------+ |
| 287 | ``` |
| 288 | |
| 289 | Where: |
| 290 | |
| 291 | * `size` is the size, a var int (int32), strictly positive values |
| 292 | * `kkkk` is the key element-type, a 4 bit unsigned int |
| 293 | * `vvvv` is the value element-type, a 4 bit unsigned int |
| 294 | * `key value pairs` are the encoded keys and values |
| 295 | |
| 296 | The element-types are the same as for lists. The full list is included in the 'List and set' section. |
| 297 | |
| 298 | The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: |
| 299 | 2147483647). |
| 300 | |
| 301 | # BNF notation used in this document |
| 302 | |
| 303 | The following BNF notation is used: |
| 304 | |
| 305 | * a plus `+` appended to an item represents repetition; the item is repeated 1 or more times |
| 306 | * a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times |
| 307 | * a pipe `|` between items represents choice, the first matching item is selected |
| 308 | * parenthesis `(` and `)` are used for grouping multiple items |