| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 1 | Thrift Compact protocol encoding |
| 2 | ================================ |
| 3 | |
| Jens Geyer | 5767901 | 2016-09-21 22:18:44 +0200 | [diff] [blame] | 4 | <!-- |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 5 | -------------------------------------------------------------------- |
| 6 | |
| 7 | Licensed to the Apache Software Foundation (ASF) under one |
| 8 | or more contributor license agreements. See the NOTICE file |
| 9 | distributed with this work for additional information |
| 10 | regarding copyright ownership. The ASF licenses this file |
| 11 | to you under the Apache License, Version 2.0 (the |
| 12 | "License"); you may not use this file except in compliance |
| 13 | with the License. You may obtain a copy of the License at |
| 14 | |
| 15 | http://www.apache.org/licenses/LICENSE-2.0 |
| 16 | |
| 17 | Unless required by applicable law or agreed to in writing, |
| 18 | software distributed under the License is distributed on an |
| 19 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 20 | KIND, either express or implied. See the License for the |
| 21 | specific language governing permissions and limitations |
| 22 | under the License. |
| 23 | |
| 24 | -------------------------------------------------------------------- |
| Jens Geyer | 5767901 | 2016-09-21 22:18:44 +0200 | [diff] [blame] | 25 | --> |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 26 | |
| 27 | This documents describes the wire encoding for RPC using the Thrift *compact protocol*. |
| 28 | |
| 29 | The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and |
| 30 | [THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however, |
| 31 | should behave the same. |
| 32 | |
| 33 | For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf). |
| 34 | |
| 35 | # Contents |
| 36 | |
| 37 | * Compact protocol |
| 38 | * Base types |
| 39 | * Message |
| 40 | * Struct |
| 41 | * List and Set |
| 42 | * Map |
| 43 | * BNF notation used in this document |
| 44 | |
| 45 | # Compact protocol |
| 46 | |
| 47 | ## Base types |
| 48 | |
| 49 | ### Integer encoding |
| 50 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 51 | The _compact protocol_ uses [ZigZag](https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding)'ed |
| 52 | varint (aka [ULEB128](https://en.wikipedia.org/wiki/LEB128)) encoding. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 53 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 54 | Values of type `int8` are encoded as one byte, rest are converted to `int64`, Zigzag'ed then encoded as varint. |
| 55 | Zigzag encoding maps signed integers to another domain, one where the sign bit is encoded in the least significant |
| 56 | bit (LSB). For example 0 maps to 0, -1 to 1, 1 to 2, -2 to 3, etc. Hence the term zigzag. Mapping the sign bit to |
| 57 | the LSB is important for compactness when the absolute value of the value is small, as ULEB encoding is more |
| 58 | efficient for small values. Here are the (Scala) formulas to convert from `int64` to a zigzag `int64` and back: |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 59 | |
| 60 | ```scala |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 61 | def ToZigzag(n: Long): Long = (n << 1) ^ (n >> 63) |
| 62 | def FromZigzag(n: Long): Long = (n >>> 1) ^ - (n & 1) |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 63 | ``` |
| 64 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 65 | A ULEB128 is encoded 7-bits at a time, starting from the LSB. Each 7-bits are encoded as 8-bits with the top bit set |
| 66 | if this is not the last byte, unset otherwise. |
| Juan Cruz Viotti | 2e7f39f | 2021-01-20 17:05:19 -0400 | [diff] [blame] | 67 | |
| 68 | For example, the integer 50399 is encoded as follows: |
| 69 | |
| 70 | ``` |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 71 | 50399 = 11000100 11011111 (LSB) |
| 72 | = 0000011 0001001 1011111 (7-bit groups) |
| 73 | = 00000011 10001001 11011111 (add continuation bits) |
| 74 | = 0x03 0x89 0xDF (hex) |
| 75 | → 0xDF 0x89 0x03 (write to ram LSB first) |
| Juan Cruz Viotti | 2e7f39f | 2021-01-20 17:05:19 -0400 | [diff] [blame] | 76 | ``` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 77 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 78 | Varints are sometimes used directly inside the compact protocol to represent positive numbers. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 79 | |
| 80 | ### Enum encoding |
| 81 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 82 | The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an `int32`. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 83 | |
| 84 | ### Binary encoding |
| 85 | |
| 86 | Binary is sent as follows: |
| 87 | |
| 88 | ``` |
| 89 | Binary protocol, binary data, 1+ bytes: |
| 90 | +--------+...+--------+--------+...+--------+ |
| 91 | | byte length | bytes | |
| 92 | +--------+...+--------+--------+...+--------+ |
| 93 | ``` |
| 94 | |
| 95 | Where: |
| 96 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 97 | * `byte length` is the length of the byte array, using varint encoding (must be >= 0). |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 98 | * `bytes` are the bytes of the byte array. |
| 99 | |
| 100 | ### String encoding |
| 101 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 102 | *String*s are first encoded to UTF-8, and then send as binary. They do not include a NUL delimiter. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 103 | |
| 104 | ### Double encoding |
| 105 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 106 | Values of type `double` are first converted to an `int64` according to the IEEE 754 floating-point "double format" |
| 107 | bit layout. Most run-times provide a library to make this conversion. But while the binary protocol encodes the |
| 108 | `int64` in 8 bytes in big endian order, the compact protocol encodes it in little endian order - this is due to an |
| 109 | early implementation bug that finally became the de-facto standard. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 110 | |
| 111 | ### Boolean encoding |
| 112 | |
| 113 | Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set, |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 114 | list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an |
| Manuel Pöter | 2c29c56 | 2024-12-13 09:34:44 +0100 | [diff] [blame] | 115 | `int8`; true as `1` and false as `2`. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 116 | |
| Triton Circonflexe | 4959a92 | 2022-06-07 21:40:41 +0200 | [diff] [blame] | 117 | ### Universal unique identifier encoding |
| 118 | |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 119 | Values of `uuid` type are expected as 16-byte binary in big endian order. Byte order conversion might be necessary on |
| 120 | certain platforms, e.g. Windows holds GUIDs in a complex record-like structure whose memory layout differs. |
| Triton Circonflexe | 4959a92 | 2022-06-07 21:40:41 +0200 | [diff] [blame] | 121 | |
| 122 | *Note*: Since the length is fixed, no `byte length` prefix is necessary and the field is always 16 bytes long. |
| 123 | |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 124 | ## Message |
| 125 | |
| 126 | A `Message` on the wire looks as follows: |
| 127 | |
| 128 | ``` |
| 129 | Compact protocol Message (4+ bytes): |
| 130 | +--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+ |
| 131 | |pppppppp|mmmvvvvv| seq id | name length | name | |
| 132 | +--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+ |
| 133 | ``` |
| 134 | |
| 135 | Where: |
| 136 | |
| 137 | * `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82. |
| 138 | * `mmm` is the message type, an unsigned 3 bit integer. |
| 139 | * `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`. |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 140 | * `seq id` is the sequence id, a signed 32 bit integer encoded as a varint. |
| 141 | * `name length` is the byte length of the name field, a signed 32 bit integer encoded as a varint (must be >= 0). |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 142 | * `name` is the method name to invoke, a UTF-8 encoded string. |
| 143 | |
| 144 | Message types are encoded with the following values: |
| 145 | |
| 146 | * _Call_: 1 |
| 147 | * _Reply_: 2 |
| 148 | * _Exception_: 3 |
| 149 | * _Oneway_: 4 |
| 150 | |
| 151 | ### Struct |
| 152 | |
| 153 | A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and |
| 154 | is followed by the encoded field value. The encoding can be summarized by the following BNF: |
| 155 | |
| 156 | ``` |
| 157 | struct ::= ( field-header field-value )* stop-field |
| 158 | field-header ::= field-type field-id |
| 159 | ``` |
| 160 | |
| 161 | Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any |
| 162 | order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also |
| 163 | possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to |
| 164 | determine how to decode the field value. |
| 165 | |
| 166 | Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility. |
| 167 | |
| 168 | The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has |
| Klaus Trainer | e41e47c | 2017-05-17 11:11:19 +0200 | [diff] [blame] | 169 | another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 170 | Other implementation may perform this check and then either ignore the field, or return a protocol exception. |
| 171 | |
| 172 | A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded. |
| 173 | |
| 174 | An *Exception* is encoded exactly the same as a struct. |
| 175 | |
| 176 | ### Struct encoding |
| 177 | |
| 178 | ``` |
| 179 | Compact protocol field header (short form) and field value: |
| 180 | +--------+--------+...+--------+ |
| 181 | |ddddtttt| field value | |
| 182 | +--------+--------+...+--------+ |
| 183 | |
| 184 | Compact protocol field header (1 to 3 bytes, long form) and field value: |
| 185 | +--------+--------+...+--------+--------+...+--------+ |
| 186 | |0000tttt| field id | field value | |
| 187 | +--------+--------+...+--------+--------+...+--------+ |
| 188 | |
| 189 | Compact protocol stop field: |
| 190 | +--------+ |
| 191 | |00000000| |
| 192 | +--------+ |
| 193 | ``` |
| 194 | |
| 195 | Where: |
| 196 | |
| 197 | * `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive. |
| 198 | * `tttt` is field-type id, an unsigned 4 bit integer. |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 199 | * `field id` the field id, a varint (int16). Max field id is 32767. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 200 | * `field-value` the encoded field value. |
| 201 | |
| 202 | The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the |
| 203 | first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive). |
| 204 | |
| 205 | The following field-types can be encoded: |
| 206 | |
| 207 | * `BOOLEAN_TRUE`, encoded as `1` |
| 208 | * `BOOLEAN_FALSE`, encoded as `2` |
| Triton | ace8613 | 2021-07-20 08:01:19 +0200 | [diff] [blame] | 209 | * `I8`, encoded as `3` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 210 | * `I16`, encoded as `4` |
| 211 | * `I32`, encoded as `5` |
| 212 | * `I64`, encoded as `6` |
| 213 | * `DOUBLE`, encoded as `7` |
| 214 | * `BINARY`, used for binary and string fields, encoded as `8` |
| 215 | * `LIST`, encoded as `9` |
| 216 | * `SET`, encoded as `10` |
| 217 | * `MAP`, encoded as `11` |
| 218 | * `STRUCT`, used for both structs and union fields, encoded as `12` |
| Triton Circonflexe | 4959a92 | 2022-06-07 21:40:41 +0200 | [diff] [blame] | 219 | * `UUID`, encoded as `13` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 220 | |
| 221 | Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no |
| 222 | length (0 bytes). |
| 223 | |
| 224 | ## List and Set |
| 225 | |
| 226 | List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the |
| 227 | encoded elements. |
| 228 | |
| 229 | ``` |
| 230 | Compact protocol list header (1 byte, short form) and elements: |
| 231 | +--------+--------+...+--------+ |
| 232 | |sssstttt| elements | |
| 233 | +--------+--------+...+--------+ |
| 234 | |
| 235 | Compact protocol list header (2+ bytes, long form) and elements: |
| 236 | +--------+--------+...+--------+--------+...+--------+ |
| 237 | |1111tttt| size | elements | |
| 238 | +--------+--------+...+--------+--------+...+--------+ |
| 239 | ``` |
| 240 | |
| 241 | Where: |
| 242 | |
| 243 | * `ssss` is the size, 4 bit unsigned int, values `0` - `14` |
| 244 | * `tttt` is the element-type, a 4 bit unsigned int |
| Alkis Evlogimenos | 2bbaaa8 | 2024-05-27 16:16:02 +0000 | [diff] [blame] | 245 | * `size` is the size, a varint (int32), positive values `15` or higher |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 246 | * `elements` are the encoded elements |
| 247 | |
| 248 | The short form should be used when the length is in the range 0 - 14 (inclusive). |
| 249 | |
| Manuel Pöter | 2c29c56 | 2024-12-13 09:34:44 +0100 | [diff] [blame] | 250 | The following element-types are used (see note 1 below): |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 251 | |
| Jens Geyer | 645467e | 2024-12-13 23:58:01 +0100 | [diff] [blame] | 252 | * `BOOL`, encoded as `1` or `2` (see note 2 below) |
| Triton | ace8613 | 2021-07-20 08:01:19 +0200 | [diff] [blame] | 253 | * `I8`, encoded as `3` |
| 254 | * `I16`, encoded as `4` |
| 255 | * `I32`, encoded as `5` |
| 256 | * `I64`, encoded as `6` |
| 257 | * `DOUBLE`, encoded as `7` |
| 258 | * `BINARY`, used for binary and string fields, encoded as `8` |
| 259 | * `LIST`, encoded as `9` |
| 260 | * `SET`, encoded as `10` |
| 261 | * `MAP`, encoded as `11` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 262 | * `STRUCT`, used for structs and union fields, encoded as `12` |
| Triton Circonflexe | 4959a92 | 2022-06-07 21:40:41 +0200 | [diff] [blame] | 263 | * `UUID`, encoded as `13` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 264 | |
| Manuel Pöter | 2c29c56 | 2024-12-13 09:34:44 +0100 | [diff] [blame] | 265 | *Note*: |
| 266 | 1. Although field-types and element-types lists are currently very similar, there is _no guarantee_ that this will |
| Triton | ace8613 | 2021-07-20 08:01:19 +0200 | [diff] [blame] | 267 | remain true after new types are added. |
| Manuel Pöter | 2c29c56 | 2024-12-13 09:34:44 +0100 | [diff] [blame] | 268 | 2. For historical and compatibility reasons, a reader should be capable to deal with *both* cases. |
| Jens Geyer | 645467e | 2024-12-13 23:58:01 +0100 | [diff] [blame] | 269 | The only valid value in the original spec was `2`, but due to an widespread implementation bug the defacto |
| 270 | standard across large parts of the library became `1` instead. As a result, both values are now allowed. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 271 | |
| 272 | The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: |
| 273 | 2147483647). |
| 274 | |
| 275 | ## Map |
| 276 | |
| 277 | Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed |
| 278 | by the encoded elements. The encoding follows this BNF: |
| 279 | |
| 280 | ``` |
| 281 | map ::= empty-map | non-empty-map |
| 282 | empty-map ::= `0` |
| 283 | non-empty-map ::= size key-element-type value-element-type (key value)+ |
| 284 | ``` |
| 285 | |
| 286 | ``` |
| 287 | Compact protocol map header (1 byte, empty map): |
| 288 | +--------+ |
| 289 | |00000000| |
| 290 | +--------+ |
| 291 | |
| 292 | Compact protocol map header (2+ bytes, non empty map) and key value pairs: |
| 293 | +--------+...+--------+--------+--------+...+--------+ |
| 294 | | size |kkkkvvvv| key value pairs | |
| 295 | +--------+...+--------+--------+--------+...+--------+ |
| 296 | ``` |
| 297 | |
| 298 | Where: |
| 299 | |
| 300 | * `size` is the size, a var int (int32), strictly positive values |
| 301 | * `kkkk` is the key element-type, a 4 bit unsigned int |
| 302 | * `vvvv` is the value element-type, a 4 bit unsigned int |
| 303 | * `key value pairs` are the encoded keys and values |
| 304 | |
| 305 | The element-types are the same as for lists. The full list is included in the 'List and set' section. |
| 306 | |
| 307 | The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: |
| 308 | 2147483647). |
| 309 | |
| 310 | # BNF notation used in this document |
| 311 | |
| 312 | The following BNF notation is used: |
| 313 | |
| 314 | * a plus `+` appended to an item represents repetition; the item is repeated 1 or more times |
| 315 | * a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times |
| 316 | * a pipe `|` between items represents choice, the first matching item is selected |
| 317 | * parenthesis `(` and `)` are used for grouping multiple items |