blob: 3e1c4b4548fb4ca142318786c8e8b03ed038211e [file] [log] [blame] [view]
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +02001Thrift Compact protocol encoding
2================================
3
Jens Geyer57679012016-09-21 22:18:44 +02004<!--
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +02005--------------------------------------------------------------------
6
7Licensed to the Apache Software Foundation (ASF) under one
8or more contributor license agreements. See the NOTICE file
9distributed with this work for additional information
10regarding copyright ownership. The ASF licenses this file
11to you under the Apache License, Version 2.0 (the
12"License"); you may not use this file except in compliance
13with the License. You may obtain a copy of the License at
14
15 http://www.apache.org/licenses/LICENSE-2.0
16
17Unless required by applicable law or agreed to in writing,
18software distributed under the License is distributed on an
19"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
20KIND, either express or implied. See the License for the
21specific language governing permissions and limitations
22under the License.
23
24--------------------------------------------------------------------
Jens Geyer57679012016-09-21 22:18:44 +020025-->
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020026
27This documents describes the wire encoding for RPC using the Thrift *compact protocol*.
28
29The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and
30[THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however,
31should behave the same.
32
33For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf).
34
35# Contents
36
37* Compact protocol
38 * Base types
39 * Message
40 * Struct
41 * List and Set
42 * Map
43* BNF notation used in this document
44
45# Compact protocol
46
47## Base types
48
49### Integer encoding
50
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +000051The _compact protocol_ uses [ZigZag](https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding)'ed
52varint (aka [ULEB128](https://en.wikipedia.org/wiki/LEB128)) encoding.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020053
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +000054Values of type `int8` are encoded as one byte, rest are converted to `int64`, Zigzag'ed then encoded as varint.
55Zigzag encoding maps signed integers to another domain, one where the sign bit is encoded in the least significant
56bit (LSB). For example 0 maps to 0, -1 to 1, 1 to 2, -2 to 3, etc. Hence the term zigzag. Mapping the sign bit to
57the LSB is important for compactness when the absolute value of the value is small, as ULEB encoding is more
58efficient for small values. Here are the (Scala) formulas to convert from `int64` to a zigzag `int64` and back:
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020059
60```scala
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +000061def ToZigzag(n: Long): Long = (n << 1) ^ (n >> 63)
62def FromZigzag(n: Long): Long = (n >>> 1) ^ - (n & 1)
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020063```
64
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +000065A ULEB128 is encoded 7-bits at a time, starting from the LSB. Each 7-bits are encoded as 8-bits with the top bit set
66if this is not the last byte, unset otherwise.
Juan Cruz Viotti2e7f39f2021-01-20 17:05:19 -040067
68For example, the integer 50399 is encoded as follows:
69
70```
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +00007150399 = 11000100 11011111 (LSB)
72 = 0000011 0001001 1011111 (7-bit groups)
73 = 00000011 10001001 11011111 (add continuation bits)
74 = 0x03 0x89 0xDF (hex)
75→ 0xDF 0x89 0x03 (write to ram LSB first)
Juan Cruz Viotti2e7f39f2021-01-20 17:05:19 -040076```
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020077
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +000078Varints are sometimes used directly inside the compact protocol to represent positive numbers.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020079
80### Enum encoding
81
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +000082The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an `int32`.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020083
84### Binary encoding
85
86Binary is sent as follows:
87
88```
89Binary protocol, binary data, 1+ bytes:
90+--------+...+--------+--------+...+--------+
91| byte length | bytes |
92+--------+...+--------+--------+...+--------+
93```
94
95Where:
96
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +000097* `byte length` is the length of the byte array, using varint encoding (must be >= 0).
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020098* `bytes` are the bytes of the byte array.
99
100### String encoding
101
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +0000102*String*s are first encoded to UTF-8, and then send as binary. They do not include a NUL delimiter.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200103
104### Double encoding
105
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +0000106Values of type `double` are first converted to an `int64` according to the IEEE 754 floating-point "double format"
107bit layout. Most run-times provide a library to make this conversion. But while the binary protocol encodes the
108`int64` in 8 bytes in big endian order, the compact protocol encodes it in little endian order - this is due to an
109early implementation bug that finally became the de-facto standard.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200110
111### Boolean encoding
112
113Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set,
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +0000114list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an
Manuel Pöter2c29c562024-12-13 09:34:44 +0100115`int8`; true as `1` and false as `2`.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200116
Triton Circonflexe4959a922022-06-07 21:40:41 +0200117### Universal unique identifier encoding
118
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +0000119Values of `uuid` type are expected as 16-byte binary in big endian order. Byte order conversion might be necessary on
120certain platforms, e.g. Windows holds GUIDs in a complex record-like structure whose memory layout differs.
Triton Circonflexe4959a922022-06-07 21:40:41 +0200121
122*Note*: Since the length is fixed, no `byte length` prefix is necessary and the field is always 16 bytes long.
123
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200124## Message
125
126A `Message` on the wire looks as follows:
127
128```
129Compact protocol Message (4+ bytes):
130+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
131|pppppppp|mmmvvvvv| seq id | name length | name |
132+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
133```
134
135Where:
136
137* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82.
138* `mmm` is the message type, an unsigned 3 bit integer.
139* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`.
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +0000140* `seq id` is the sequence id, a signed 32 bit integer encoded as a varint.
141* `name length` is the byte length of the name field, a signed 32 bit integer encoded as a varint (must be >= 0).
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200142* `name` is the method name to invoke, a UTF-8 encoded string.
143
144Message types are encoded with the following values:
145
146* _Call_: 1
147* _Reply_: 2
148* _Exception_: 3
149* _Oneway_: 4
150
151### Struct
152
153A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and
154is followed by the encoded field value. The encoding can be summarized by the following BNF:
155
156```
157struct ::= ( field-header field-value )* stop-field
158field-header ::= field-type field-id
159```
160
161Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any
162order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also
163possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to
164determine how to decode the field value.
165
166Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility.
167
168The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has
Klaus Trainere41e47c2017-05-17 11:11:19 +0200169another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200170Other implementation may perform this check and then either ignore the field, or return a protocol exception.
171
172A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded.
173
174An *Exception* is encoded exactly the same as a struct.
175
176### Struct encoding
177
178```
179Compact protocol field header (short form) and field value:
180+--------+--------+...+--------+
181|ddddtttt| field value |
182+--------+--------+...+--------+
183
184Compact protocol field header (1 to 3 bytes, long form) and field value:
185+--------+--------+...+--------+--------+...+--------+
186|0000tttt| field id | field value |
187+--------+--------+...+--------+--------+...+--------+
188
189Compact protocol stop field:
190+--------+
191|00000000|
192+--------+
193```
194
195Where:
196
197* `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive.
198* `tttt` is field-type id, an unsigned 4 bit integer.
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +0000199* `field id` the field id, a varint (int16). Max field id is 32767.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200200* `field-value` the encoded field value.
201
202The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the
203first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive).
204
205The following field-types can be encoded:
206
207* `BOOLEAN_TRUE`, encoded as `1`
208* `BOOLEAN_FALSE`, encoded as `2`
Tritonace86132021-07-20 08:01:19 +0200209* `I8`, encoded as `3`
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200210* `I16`, encoded as `4`
211* `I32`, encoded as `5`
212* `I64`, encoded as `6`
213* `DOUBLE`, encoded as `7`
214* `BINARY`, used for binary and string fields, encoded as `8`
215* `LIST`, encoded as `9`
216* `SET`, encoded as `10`
217* `MAP`, encoded as `11`
218* `STRUCT`, used for both structs and union fields, encoded as `12`
Triton Circonflexe4959a922022-06-07 21:40:41 +0200219* `UUID`, encoded as `13`
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200220
221Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no
222length (0 bytes).
223
224## List and Set
225
226List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the
227encoded elements.
228
229```
230Compact protocol list header (1 byte, short form) and elements:
231+--------+--------+...+--------+
232|sssstttt| elements |
233+--------+--------+...+--------+
234
235Compact protocol list header (2+ bytes, long form) and elements:
236+--------+--------+...+--------+--------+...+--------+
237|1111tttt| size | elements |
238+--------+--------+...+--------+--------+...+--------+
239```
240
241Where:
242
243* `ssss` is the size, 4 bit unsigned int, values `0` - `14`
244* `tttt` is the element-type, a 4 bit unsigned int
Alkis Evlogimenos2bbaaa82024-05-27 16:16:02 +0000245* `size` is the size, a varint (int32), positive values `15` or higher
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200246* `elements` are the encoded elements
247
248The short form should be used when the length is in the range 0 - 14 (inclusive).
249
Manuel Pöter2c29c562024-12-13 09:34:44 +0100250The following element-types are used (see note 1 below):
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200251
Jens Geyer645467e2024-12-13 23:58:01 +0100252* `BOOL`, encoded as `1` or `2` (see note 2 below)
Tritonace86132021-07-20 08:01:19 +0200253* `I8`, encoded as `3`
254* `I16`, encoded as `4`
255* `I32`, encoded as `5`
256* `I64`, encoded as `6`
257* `DOUBLE`, encoded as `7`
258* `BINARY`, used for binary and string fields, encoded as `8`
259* `LIST`, encoded as `9`
260* `SET`, encoded as `10`
261* `MAP`, encoded as `11`
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200262* `STRUCT`, used for structs and union fields, encoded as `12`
Triton Circonflexe4959a922022-06-07 21:40:41 +0200263* `UUID`, encoded as `13`
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200264
Manuel Pöter2c29c562024-12-13 09:34:44 +0100265*Note*:
2661. Although field-types and element-types lists are currently very similar, there is _no guarantee_ that this will
Tritonace86132021-07-20 08:01:19 +0200267remain true after new types are added.
Manuel Pöter2c29c562024-12-13 09:34:44 +01002682. For historical and compatibility reasons, a reader should be capable to deal with *both* cases.
Jens Geyer645467e2024-12-13 23:58:01 +0100269The only valid value in the original spec was `2`, but due to an widespread implementation bug the defacto
270standard across large parts of the library became `1` instead. As a result, both values are now allowed.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200271
272The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2732147483647).
274
275## Map
276
277Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed
278by the encoded elements. The encoding follows this BNF:
279
280```
281map ::= empty-map | non-empty-map
282empty-map ::= `0`
283non-empty-map ::= size key-element-type value-element-type (key value)+
284```
285
286```
287Compact protocol map header (1 byte, empty map):
288+--------+
289|00000000|
290+--------+
291
292Compact protocol map header (2+ bytes, non empty map) and key value pairs:
293+--------+...+--------+--------+--------+...+--------+
294| size |kkkkvvvv| key value pairs |
295+--------+...+--------+--------+--------+...+--------+
296```
297
298Where:
299
300* `size` is the size, a var int (int32), strictly positive values
301* `kkkk` is the key element-type, a 4 bit unsigned int
302* `vvvv` is the value element-type, a 4 bit unsigned int
303* `key value pairs` are the encoded keys and values
304
305The element-types are the same as for lists. The full list is included in the 'List and set' section.
306
307The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
3082147483647).
309
310# BNF notation used in this document
311
312The following BNF notation is used:
313
314* a plus `+` appended to an item represents repetition; the item is repeated 1 or more times
315* a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times
316* a pipe `|` between items represents choice, the first matching item is selected
317* parenthesis `(` and `)` are used for grouping multiple items