| Triton | ace8613 | 2021-07-20 08:01:19 +0200 | [diff] [blame] | 1 | Thrift Binary protocol encoding |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 2 | =============================== |
| 3 | |
| Jens Geyer | 5767901 | 2016-09-21 22:18:44 +0200 | [diff] [blame] | 4 | <!-- |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 5 | -------------------------------------------------------------------- |
| 6 | |
| 7 | Licensed to the Apache Software Foundation (ASF) under one |
| 8 | or more contributor license agreements. See the NOTICE file |
| 9 | distributed with this work for additional information |
| 10 | regarding copyright ownership. The ASF licenses this file |
| 11 | to you under the Apache License, Version 2.0 (the |
| 12 | "License"); you may not use this file except in compliance |
| 13 | with the License. You may obtain a copy of the License at |
| 14 | |
| 15 | http://www.apache.org/licenses/LICENSE-2.0 |
| 16 | |
| 17 | Unless required by applicable law or agreed to in writing, |
| 18 | software distributed under the License is distributed on an |
| 19 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 20 | KIND, either express or implied. See the License for the |
| 21 | specific language governing permissions and limitations |
| 22 | under the License. |
| 23 | |
| 24 | -------------------------------------------------------------------- |
| Jens Geyer | 5767901 | 2016-09-21 22:18:44 +0200 | [diff] [blame] | 25 | --> |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 26 | |
| PoojaChandak | 20205b8 | 2020-11-06 11:33:40 +0100 | [diff] [blame] | 27 | This document describes the wire encoding for RPC using the older Thrift *binary protocol*. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 28 | |
| 29 | The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1 and |
| PoojaChandak | 20205b8 | 2020-11-06 11:33:40 +0100 | [diff] [blame] | 30 | 0.9.3). Other implementation, however, should behave the same. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 31 | |
| 32 | For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf). |
| 33 | |
| 34 | # Contents |
| 35 | |
| 36 | * Binary protocol |
| 37 | * Base types |
| 38 | * Message |
| 39 | * Struct |
| 40 | * List and Set |
| 41 | * Map |
| 42 | * BNF notation used in this document |
| 43 | |
| 44 | # Binary protocol |
| 45 | |
| 46 | ## Base types |
| 47 | |
| 48 | ### Integer encoding |
| 49 | |
| 50 | In the _binary protocol_ integers are encoded with the most significant byte first (big endian byte order, aka network |
| 51 | order). An `int8` needs 1 byte, an `int16` 2, an `int32` 4 and an `int64` needs 8 bytes. |
| 52 | |
| 53 | The CPP version has the option to use the binary protocol with little endian order. Little endian gives a small but |
| 54 | noticeable performance boost because contemporary CPUs use little endian when storing integers to RAM. |
| 55 | |
| 56 | ### Enum encoding |
| 57 | |
| 58 | The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32. |
| 59 | |
| 60 | ### Binary encoding |
| 61 | |
| 62 | Binary is sent as follows: |
| 63 | |
| 64 | ``` |
| 65 | Binary protocol, binary data, 4+ bytes: |
| 66 | +--------+--------+--------+--------+--------+...+--------+ |
| 67 | | byte length | bytes | |
| 68 | +--------+--------+--------+--------+--------+...+--------+ |
| 69 | ``` |
| 70 | |
| 71 | Where: |
| 72 | |
| 73 | * `byte length` is the length of the byte array, a signed 32 bit integer encoded in network (big endian) order (must be >= 0). |
| 74 | * `bytes` are the bytes of the byte array. |
| 75 | |
| 76 | ### String encoding |
| 77 | |
| 78 | *String*s are first encoded to UTF-8, and then send as binary. |
| 79 | |
| 80 | ### Double encoding |
| 81 | |
| 82 | Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit |
| Elliotte Rusty Harold | 71df9a3 | 2023-02-15 06:49:58 -0500 | [diff] [blame^] | 83 | layout. Most run-times provide a library to make this conversion. Both the binary protocol and the compact protocol then |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 84 | encode the int64 in 8 bytes in big endian order. |
| 85 | |
| 86 | ### Boolean encoding |
| 87 | |
| 88 | Values of `bool` type are first converted to an int8. True is converted to `1`, false to `0`. |
| 89 | |
| Triton Circonflexe | 4959a92 | 2022-06-07 21:40:41 +0200 | [diff] [blame] | 90 | ### Universal unique identifier encoding |
| 91 | |
| 92 | Values of `uuid` type are expected as 16-byte binary in big endian (or "network") order. Byte order conversion |
| 93 | might be necessary on certain platforms, e.g. Windows holds GUIDs in a complex record-like structure whose |
| 94 | memory layout differs. |
| 95 | |
| 96 | *Note*: Since the length is fixed, no `byte length` prefix is necessary and the field is always 16 bytes long. |
| 97 | |
| 98 | |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 99 | ## Message |
| 100 | |
| 101 | A `Message` can be encoded in two different ways: |
| 102 | |
| 103 | ``` |
| 104 | Binary protocol Message, strict encoding, 12+ bytes: |
| 105 | +--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+ |
| 106 | |1vvvvvvv|vvvvvvvv|unused |00000mmm| name length | name | seq id | |
| 107 | +--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+ |
| 108 | ``` |
| 109 | |
| 110 | Where: |
| 111 | |
| 112 | * `vvvvvvvvvvvvvvv` is the version, an unsigned 15 bit number fixed to `1` (in binary: `000 0000 0000 0001`). |
| 113 | The leading bit is `1`. |
| 114 | * `unused` is an ignored byte. |
| 115 | * `mmm` is the message type, an unsigned 3 bit integer. The 5 leading bits must be `0` as some clients (checked for |
| 116 | java in 0.9.1) take the whole byte. |
| 117 | * `name length` is the byte length of the name field, a signed 32 bit integer encoded in network (big endian) order (must be >= 0). |
| 118 | * `name` is the method name, a UTF-8 encoded string. |
| 119 | * `seq id` is the sequence id, a signed 32 bit integer encoded in network (big endian) order. |
| 120 | |
| 121 | The second, older encoding (aka non-strict) is: |
| 122 | |
| 123 | ``` |
| 124 | Binary protocol Message, old encoding, 9+ bytes: |
| 125 | +--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+ |
| 126 | | name length | name |00000mmm| seq id | |
| 127 | +--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+ |
| 128 | ``` |
| 129 | |
| 130 | Where `name length`, `name`, `mmm`, `seq id` are as above. |
| 131 | |
| 132 | Because `name length` must be positive (therefore the first bit is always `0`), the first bit allows the receiver to see |
| 133 | whether the strict format or the old format is used. Therefore a server and client using the different variants of the |
| 134 | binary protocol can transparently talk with each other. However, when strict mode is enforced, the old format is |
| 135 | rejected. |
| 136 | |
| 137 | Message types are encoded with the following values: |
| 138 | |
| 139 | * _Call_: 1 |
| 140 | * _Reply_: 2 |
| 141 | * _Exception_: 3 |
| 142 | * _Oneway_: 4 |
| 143 | |
| 144 | ## Struct |
| 145 | |
| 146 | A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and |
| 147 | is followed by the encoded field value. The encoding can be summarized by the following BNF: |
| 148 | |
| 149 | ``` |
| 150 | struct ::= ( field-header field-value )* stop-field |
| 151 | field-header ::= field-type field-id |
| 152 | ``` |
| 153 | |
| 154 | Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any |
| 155 | order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also |
| 156 | possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to |
| 157 | determine how to decode the field value. |
| 158 | |
| 159 | Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility. |
| 160 | |
| 161 | The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has |
| PoojaChandak | 20205b8 | 2020-11-06 11:33:40 +0100 | [diff] [blame] | 162 | another field-type than what is expected. Theoretically, this could be detected at the cost of some additional checking. |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 163 | Other implementation may perform this check and then either ignore the field, or return a protocol exception. |
| 164 | |
| 165 | A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded. |
| 166 | |
| 167 | An *Exception* is encoded exactly the same as a struct. |
| 168 | |
| 169 | ### Struct encoding |
| 170 | |
| 171 | In the binary protocol field headers and the stop field are encoded as follows: |
| 172 | |
| 173 | ``` |
| 174 | Binary protocol field header and field value: |
| 175 | +--------+--------+--------+--------+...+--------+ |
| 176 | |tttttttt| field id | field value | |
| 177 | +--------+--------+--------+--------+...+--------+ |
| 178 | |
| 179 | Binary protocol stop field: |
| 180 | +--------+ |
| 181 | |00000000| |
| 182 | +--------+ |
| 183 | ``` |
| 184 | |
| 185 | Where: |
| 186 | |
| 187 | * `tttttttt` the field-type, a signed 8 bit integer. |
| 188 | * `field id` the field-id, a signed 16 bit integer in big endian order. |
| 189 | * `field-value` the encoded field value. |
| 190 | |
| 191 | The following field-types are used: |
| 192 | |
| 193 | * `BOOL`, encoded as `2` |
| Triton | ace8613 | 2021-07-20 08:01:19 +0200 | [diff] [blame] | 194 | * `I8`, encoded as `3` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 195 | * `DOUBLE`, encoded as `4` |
| 196 | * `I16`, encoded as `6` |
| 197 | * `I32`, encoded as `8` |
| 198 | * `I64`, encoded as `10` |
| Triton | ace8613 | 2021-07-20 08:01:19 +0200 | [diff] [blame] | 199 | * `BINARY`, used for binary and string fields, encoded as `11` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 200 | * `STRUCT`, used for structs and union fields, encoded as `12` |
| 201 | * `MAP`, encoded as `13` |
| 202 | * `SET`, encoded as `14` |
| 203 | * `LIST`, encoded as `15` |
| Triton Circonflexe | 4959a92 | 2022-06-07 21:40:41 +0200 | [diff] [blame] | 204 | * `UUID`, encoded as `16` |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 205 | |
| 206 | ## List and Set |
| 207 | |
| 208 | List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the |
| 209 | encoded elements. |
| 210 | |
| 211 | ``` |
| 212 | Binary protocol list (5+ bytes) and elements: |
| 213 | +--------+--------+--------+--------+--------+--------+...+--------+ |
| 214 | |tttttttt| size | elements | |
| 215 | +--------+--------+--------+--------+--------+--------+...+--------+ |
| 216 | ``` |
| 217 | |
| 218 | Where: |
| 219 | |
| 220 | * `tttttttt` is the element-type, encoded as an int8 |
| 221 | * `size` is the size, encoded as an int32, positive values only |
| 222 | * `elements` the element values |
| 223 | |
| 224 | The element-type values are the same as field-types. The full list is included in the struct section above. |
| 225 | |
| PoojaChandak | 20205b8 | 2020-11-06 11:33:40 +0100 | [diff] [blame] | 226 | The maximum list/set size is configurable. By default, there is no limit (meaning the limit is the maximum int32 value: |
| Erik van Oosten | 3f5fa5f | 2016-06-29 13:24:00 +0200 | [diff] [blame] | 227 | 2147483647). |
| 228 | |
| 229 | ## Map |
| 230 | |
| 231 | Maps are encoded with a header indicating the size, the element-type of the keys and the element-type of the elements, |
| 232 | followed by the encoded elements. The encoding follows this BNF: |
| 233 | |
| 234 | ``` |
| 235 | map ::= key-element-type value-element-type size ( key value )* |
| 236 | ``` |
| 237 | |
| 238 | ``` |
| 239 | Binary protocol map (6+ bytes) and key value pairs: |
| 240 | +--------+--------+--------+--------+--------+--------+--------+...+--------+ |
| 241 | |kkkkkkkk|vvvvvvvv| size | key value pairs | |
| 242 | +--------+--------+--------+--------+--------+--------+--------+...+--------+ |
| 243 | ``` |
| 244 | |
| 245 | Where: |
| 246 | |
| 247 | * `kkkkkkkk` is the key element-type, encoded as an int8 |
| 248 | * `vvvvvvvv` is the value element-type, encoded as an int8 |
| 249 | * `size` is the size of the map, encoded as an int32, positive values only |
| 250 | * `key value pairs` are the encoded keys and values |
| 251 | |
| 252 | The element-type values are the same as field-types. The full list is included in the struct section above. |
| 253 | |
| 254 | The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: |
| 255 | 2147483647). |
| 256 | |
| 257 | # BNF notation used in this document |
| 258 | |
| 259 | The following BNF notation is used: |
| 260 | |
| 261 | * a plus `+` appended to an item represents repetition; the item is repeated 1 or more times |
| 262 | * a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times |
| 263 | * a pipe `|` between items represents choice, the first matching item is selected |
| 264 | * parenthesis `(` and `)` are used for grouping multiple items |