blob: 001bb12292ec000bb570ccefe4b58b8103524b1f [file] [log] [blame] [view]
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +02001Thrift Compact protocol encoding
2================================
3
Jens Geyer57679012016-09-21 22:18:44 +02004<!--
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +02005--------------------------------------------------------------------
6
7Licensed to the Apache Software Foundation (ASF) under one
8or more contributor license agreements. See the NOTICE file
9distributed with this work for additional information
10regarding copyright ownership. The ASF licenses this file
11to you under the Apache License, Version 2.0 (the
12"License"); you may not use this file except in compliance
13with the License. You may obtain a copy of the License at
14
15 http://www.apache.org/licenses/LICENSE-2.0
16
17Unless required by applicable law or agreed to in writing,
18software distributed under the License is distributed on an
19"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
20KIND, either express or implied. See the License for the
21specific language governing permissions and limitations
22under the License.
23
24--------------------------------------------------------------------
Jens Geyer57679012016-09-21 22:18:44 +020025-->
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020026
27This documents describes the wire encoding for RPC using the Thrift *compact protocol*.
28
29The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and
30[THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however,
31should behave the same.
32
33For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf).
34
35# Contents
36
37* Compact protocol
38 * Base types
39 * Message
40 * Struct
41 * List and Set
42 * Map
43* BNF notation used in this document
44
45# Compact protocol
46
47## Base types
48
49### Integer encoding
50
51The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_.
52
53Values of type `int32` and `int64` are first transformed to a *zigzag int*. A zigzag int folds positive and negative
54numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1,
55-2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back:
56
57```scala
58def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
59def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
60def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
61def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
62```
63
64The zigzag int is then encoded as a *var int*. Var ints take 1 to 5 bytes (int32) or 1 to 10 bytes (int64). The most
65significant bit of each byte indicates if more bytes follow. The concatenation of the least significant 7 bits from each
66byte form the number, where the first byte has the most significant bits (so they are in big endian or network order).
67
68Var ints are sometimes used directly inside the compact protocol to represent positive numbers.
69
70To encode an `int16` as zigzag int, it is first converted to an `int32` and then encoded as such. The type `int8` simply
71uses a single byte as in the binary protocol.
72
73### Enum encoding
74
75The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32.
76
77### Binary encoding
78
79Binary is sent as follows:
80
81```
82Binary protocol, binary data, 1+ bytes:
83+--------+...+--------+--------+...+--------+
84| byte length | bytes |
85+--------+...+--------+--------+...+--------+
86```
87
88Where:
89
90* `byte length` is the length of the byte array, using var int encoding (must be >= 0).
91* `bytes` are the bytes of the byte array.
92
93### String encoding
94
Juan Cruz Viotti47b3d3b2021-01-21 12:22:47 -040095*String*s are first encoded to UTF-8, and then send as binary. They do not
96include a NUL delimiter.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +020097
98### Double encoding
99
100Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit
Jens Geyer450bc692019-12-03 23:28:03 +0100101layout. Most run-times provide a library to make this conversion. But while the binary protocol encodes the int64
102in 8 bytes in big endian order, the compact protocol encodes it in little endian order - this is due to an early
103implementation bug that finally became the de-facto standard.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200104
105### Boolean encoding
106
107Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set,
108list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an int8;
109true as `1` and false as `0`.
110
111## Message
112
113A `Message` on the wire looks as follows:
114
115```
116Compact protocol Message (4+ bytes):
117+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
118|pppppppp|mmmvvvvv| seq id | name length | name |
119+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
120```
121
122Where:
123
124* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82.
125* `mmm` is the message type, an unsigned 3 bit integer.
126* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`.
127* `seq id` is the sequence id, a signed 32 bit integer encoded as a var int.
128* `name length` is the byte length of the name field, a signed 32 bit integer encoded as a var int (must be >= 0).
129* `name` is the method name to invoke, a UTF-8 encoded string.
130
131Message types are encoded with the following values:
132
133* _Call_: 1
134* _Reply_: 2
135* _Exception_: 3
136* _Oneway_: 4
137
138### Struct
139
140A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and
141is followed by the encoded field value. The encoding can be summarized by the following BNF:
142
143```
144struct ::= ( field-header field-value )* stop-field
145field-header ::= field-type field-id
146```
147
148Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any
149order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also
150possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to
151determine how to decode the field value.
152
153Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility.
154
155The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has
Klaus Trainere41e47c2017-05-17 11:11:19 +0200156another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking.
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +0200157Other implementation may perform this check and then either ignore the field, or return a protocol exception.
158
159A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded.
160
161An *Exception* is encoded exactly the same as a struct.
162
163### Struct encoding
164
165```
166Compact protocol field header (short form) and field value:
167+--------+--------+...+--------+
168|ddddtttt| field value |
169+--------+--------+...+--------+
170
171Compact protocol field header (1 to 3 bytes, long form) and field value:
172+--------+--------+...+--------+--------+...+--------+
173|0000tttt| field id | field value |
174+--------+--------+...+--------+--------+...+--------+
175
176Compact protocol stop field:
177+--------+
178|00000000|
179+--------+
180```
181
182Where:
183
184* `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive.
185* `tttt` is field-type id, an unsigned 4 bit integer.
186* `field id` the field id, a signed 16 bit integer encoded as zigzag int.
187* `field-value` the encoded field value.
188
189The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the
190first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive).
191
192The following field-types can be encoded:
193
194* `BOOLEAN_TRUE`, encoded as `1`
195* `BOOLEAN_FALSE`, encoded as `2`
196* `BYTE`, encoded as `3`
197* `I16`, encoded as `4`
198* `I32`, encoded as `5`
199* `I64`, encoded as `6`
200* `DOUBLE`, encoded as `7`
201* `BINARY`, used for binary and string fields, encoded as `8`
202* `LIST`, encoded as `9`
203* `SET`, encoded as `10`
204* `MAP`, encoded as `11`
205* `STRUCT`, used for both structs and union fields, encoded as `12`
206
207Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no
208length (0 bytes).
209
210## List and Set
211
212List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the
213encoded elements.
214
215```
216Compact protocol list header (1 byte, short form) and elements:
217+--------+--------+...+--------+
218|sssstttt| elements |
219+--------+--------+...+--------+
220
221Compact protocol list header (2+ bytes, long form) and elements:
222+--------+--------+...+--------+--------+...+--------+
223|1111tttt| size | elements |
224+--------+--------+...+--------+--------+...+--------+
225```
226
227Where:
228
229* `ssss` is the size, 4 bit unsigned int, values `0` - `14`
230* `tttt` is the element-type, a 4 bit unsigned int
231* `size` is the size, a var int (int32), positive values `15` or higher
232* `elements` are the encoded elements
233
234The short form should be used when the length is in the range 0 - 14 (inclusive).
235
236The following element-types are used (note that these are _different_ from the field-types):
237
238* `BOOL`, encoded as `2`
239* `BYTE`, encoded as `3`
240* `DOUBLE`, encoded as `4`
241* `I16`, encoded as `6`
242* `I32`, encoded as `8`
243* `I64`, encoded as `10`
244* `STRING`, used for binary and string fields, encoded as `11`
245* `STRUCT`, used for structs and union fields, encoded as `12`
246* `MAP`, encoded as `13`
247* `SET`, encoded as `14`
248* `LIST`, encoded as `15`
249
250
251The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2522147483647).
253
254## Map
255
256Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed
257by the encoded elements. The encoding follows this BNF:
258
259```
260map ::= empty-map | non-empty-map
261empty-map ::= `0`
262non-empty-map ::= size key-element-type value-element-type (key value)+
263```
264
265```
266Compact protocol map header (1 byte, empty map):
267+--------+
268|00000000|
269+--------+
270
271Compact protocol map header (2+ bytes, non empty map) and key value pairs:
272+--------+...+--------+--------+--------+...+--------+
273| size |kkkkvvvv| key value pairs |
274+--------+...+--------+--------+--------+...+--------+
275```
276
277Where:
278
279* `size` is the size, a var int (int32), strictly positive values
280* `kkkk` is the key element-type, a 4 bit unsigned int
281* `vvvv` is the value element-type, a 4 bit unsigned int
282* `key value pairs` are the encoded keys and values
283
284The element-types are the same as for lists. The full list is included in the 'List and set' section.
285
286The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2872147483647).
288
289# BNF notation used in this document
290
291The following BNF notation is used:
292
293* a plus `+` appended to an item represents repetition; the item is repeated 1 or more times
294* a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times
295* a pipe `|` between items represents choice, the first matching item is selected
296* parenthesis `(` and `)` are used for grouping multiple items