blob: 96e7b0eee9e15ce1dd319c9023575bf37b31a6a1 [file] [log] [blame] [view]
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +02001Thrift Compact protocol encoding
2================================
3
4--------------------------------------------------------------------
5
6Licensed to the Apache Software Foundation (ASF) under one
7or more contributor license agreements. See the NOTICE file
8distributed with this work for additional information
9regarding copyright ownership. The ASF licenses this file
10to you under the Apache License, Version 2.0 (the
11"License"); you may not use this file except in compliance
12with the License. You may obtain a copy of the License at
13
14 http://www.apache.org/licenses/LICENSE-2.0
15
16Unless required by applicable law or agreed to in writing,
17software distributed under the License is distributed on an
18"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
19KIND, either express or implied. See the License for the
20specific language governing permissions and limitations
21under the License.
22
23--------------------------------------------------------------------
24
25This documents describes the wire encoding for RPC using the Thrift *compact protocol*.
26
27The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and
28[THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however,
29should behave the same.
30
31For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf).
32
33# Contents
34
35* Compact protocol
36 * Base types
37 * Message
38 * Struct
39 * List and Set
40 * Map
41* BNF notation used in this document
42
43# Compact protocol
44
45## Base types
46
47### Integer encoding
48
49The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_.
50
51Values of type `int32` and `int64` are first transformed to a *zigzag int*. A zigzag int folds positive and negative
52numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1,
53-2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back:
54
55```scala
56def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
57def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
58def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
59def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
60```
61
62The zigzag int is then encoded as a *var int*. Var ints take 1 to 5 bytes (int32) or 1 to 10 bytes (int64). The most
63significant bit of each byte indicates if more bytes follow. The concatenation of the least significant 7 bits from each
64byte form the number, where the first byte has the most significant bits (so they are in big endian or network order).
65
66Var ints are sometimes used directly inside the compact protocol to represent positive numbers.
67
68To encode an `int16` as zigzag int, it is first converted to an `int32` and then encoded as such. The type `int8` simply
69uses a single byte as in the binary protocol.
70
71### Enum encoding
72
73The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32.
74
75### Binary encoding
76
77Binary is sent as follows:
78
79```
80Binary protocol, binary data, 1+ bytes:
81+--------+...+--------+--------+...+--------+
82| byte length | bytes |
83+--------+...+--------+--------+...+--------+
84```
85
86Where:
87
88* `byte length` is the length of the byte array, using var int encoding (must be >= 0).
89* `bytes` are the bytes of the byte array.
90
91### String encoding
92
93*String*s are first encoded to UTF-8, and then send as binary.
94
95### Double encoding
96
97Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit
98layout. Most run-times provide a library to make this conversion. Both the binary protocol as the compact protocol then
99encode the int64 in 8 bytes in big endian order.
100
101### Boolean encoding
102
103Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set,
104list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an int8;
105true as `1` and false as `0`.
106
107## Message
108
109A `Message` on the wire looks as follows:
110
111```
112Compact protocol Message (4+ bytes):
113+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
114|pppppppp|mmmvvvvv| seq id | name length | name |
115+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
116```
117
118Where:
119
120* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82.
121* `mmm` is the message type, an unsigned 3 bit integer.
122* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`.
123* `seq id` is the sequence id, a signed 32 bit integer encoded as a var int.
124* `name length` is the byte length of the name field, a signed 32 bit integer encoded as a var int (must be >= 0).
125* `name` is the method name to invoke, a UTF-8 encoded string.
126
127Message types are encoded with the following values:
128
129* _Call_: 1
130* _Reply_: 2
131* _Exception_: 3
132* _Oneway_: 4
133
134### Struct
135
136A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and
137is followed by the encoded field value. The encoding can be summarized by the following BNF:
138
139```
140struct ::= ( field-header field-value )* stop-field
141field-header ::= field-type field-id
142```
143
144Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any
145order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also
146possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to
147determine how to decode the field value.
148
149Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility.
150
151The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has
152another field-type then what is expected. Theoretically this could be detected at the cost of some additional checking.
153Other implementation may perform this check and then either ignore the field, or return a protocol exception.
154
155A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded.
156
157An *Exception* is encoded exactly the same as a struct.
158
159### Struct encoding
160
161```
162Compact protocol field header (short form) and field value:
163+--------+--------+...+--------+
164|ddddtttt| field value |
165+--------+--------+...+--------+
166
167Compact protocol field header (1 to 3 bytes, long form) and field value:
168+--------+--------+...+--------+--------+...+--------+
169|0000tttt| field id | field value |
170+--------+--------+...+--------+--------+...+--------+
171
172Compact protocol stop field:
173+--------+
174|00000000|
175+--------+
176```
177
178Where:
179
180* `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive.
181* `tttt` is field-type id, an unsigned 4 bit integer.
182* `field id` the field id, a signed 16 bit integer encoded as zigzag int.
183* `field-value` the encoded field value.
184
185The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the
186first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive).
187
188The following field-types can be encoded:
189
190* `BOOLEAN_TRUE`, encoded as `1`
191* `BOOLEAN_FALSE`, encoded as `2`
192* `BYTE`, encoded as `3`
193* `I16`, encoded as `4`
194* `I32`, encoded as `5`
195* `I64`, encoded as `6`
196* `DOUBLE`, encoded as `7`
197* `BINARY`, used for binary and string fields, encoded as `8`
198* `LIST`, encoded as `9`
199* `SET`, encoded as `10`
200* `MAP`, encoded as `11`
201* `STRUCT`, used for both structs and union fields, encoded as `12`
202
203Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no
204length (0 bytes).
205
206## List and Set
207
208List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the
209encoded elements.
210
211```
212Compact protocol list header (1 byte, short form) and elements:
213+--------+--------+...+--------+
214|sssstttt| elements |
215+--------+--------+...+--------+
216
217Compact protocol list header (2+ bytes, long form) and elements:
218+--------+--------+...+--------+--------+...+--------+
219|1111tttt| size | elements |
220+--------+--------+...+--------+--------+...+--------+
221```
222
223Where:
224
225* `ssss` is the size, 4 bit unsigned int, values `0` - `14`
226* `tttt` is the element-type, a 4 bit unsigned int
227* `size` is the size, a var int (int32), positive values `15` or higher
228* `elements` are the encoded elements
229
230The short form should be used when the length is in the range 0 - 14 (inclusive).
231
232The following element-types are used (note that these are _different_ from the field-types):
233
234* `BOOL`, encoded as `2`
235* `BYTE`, encoded as `3`
236* `DOUBLE`, encoded as `4`
237* `I16`, encoded as `6`
238* `I32`, encoded as `8`
239* `I64`, encoded as `10`
240* `STRING`, used for binary and string fields, encoded as `11`
241* `STRUCT`, used for structs and union fields, encoded as `12`
242* `MAP`, encoded as `13`
243* `SET`, encoded as `14`
244* `LIST`, encoded as `15`
245
246
247The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2482147483647).
249
250## Map
251
252Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed
253by the encoded elements. The encoding follows this BNF:
254
255```
256map ::= empty-map | non-empty-map
257empty-map ::= `0`
258non-empty-map ::= size key-element-type value-element-type (key value)+
259```
260
261```
262Compact protocol map header (1 byte, empty map):
263+--------+
264|00000000|
265+--------+
266
267Compact protocol map header (2+ bytes, non empty map) and key value pairs:
268+--------+...+--------+--------+--------+...+--------+
269| size |kkkkvvvv| key value pairs |
270+--------+...+--------+--------+--------+...+--------+
271```
272
273Where:
274
275* `size` is the size, a var int (int32), strictly positive values
276* `kkkk` is the key element-type, a 4 bit unsigned int
277* `vvvv` is the value element-type, a 4 bit unsigned int
278* `key value pairs` are the encoded keys and values
279
280The element-types are the same as for lists. The full list is included in the 'List and set' section.
281
282The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2832147483647).
284
285# BNF notation used in this document
286
287The following BNF notation is used:
288
289* a plus `+` appended to an item represents repetition; the item is repeated 1 or more times
290* a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times
291* a pipe `|` between items represents choice, the first matching item is selected
292* parenthesis `(` and `)` are used for grouping multiple items