blob: b56d261dc5816b0bcbd35a953702dcc071676cdb [file] [log] [blame] [view]
Erik van Oosten3f5fa5f2016-06-29 13:24:00 +02001Thrift Binary protocol encoding
2===============================
3
4--------------------------------------------------------------------
5
6Licensed to the Apache Software Foundation (ASF) under one
7or more contributor license agreements. See the NOTICE file
8distributed with this work for additional information
9regarding copyright ownership. The ASF licenses this file
10to you under the Apache License, Version 2.0 (the
11"License"); you may not use this file except in compliance
12with the License. You may obtain a copy of the License at
13
14 http://www.apache.org/licenses/LICENSE-2.0
15
16Unless required by applicable law or agreed to in writing,
17software distributed under the License is distributed on an
18"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
19KIND, either express or implied. See the License for the
20specific language governing permissions and limitations
21under the License.
22
23--------------------------------------------------------------------
24
25This documents describes the wire encoding for RPC using the older Thrift *binary protocol*.
26
27The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1 and
280.9.3). Other implementation however, should behave the same.
29
30For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf).
31
32# Contents
33
34* Binary protocol
35 * Base types
36 * Message
37 * Struct
38 * List and Set
39 * Map
40* BNF notation used in this document
41
42# Binary protocol
43
44## Base types
45
46### Integer encoding
47
48In the _binary protocol_ integers are encoded with the most significant byte first (big endian byte order, aka network
49order). An `int8` needs 1 byte, an `int16` 2, an `int32` 4 and an `int64` needs 8 bytes.
50
51The CPP version has the option to use the binary protocol with little endian order. Little endian gives a small but
52noticeable performance boost because contemporary CPUs use little endian when storing integers to RAM.
53
54### Enum encoding
55
56The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32.
57
58### Binary encoding
59
60Binary is sent as follows:
61
62```
63Binary protocol, binary data, 4+ bytes:
64+--------+--------+--------+--------+--------+...+--------+
65| byte length | bytes |
66+--------+--------+--------+--------+--------+...+--------+
67```
68
69Where:
70
71* `byte length` is the length of the byte array, a signed 32 bit integer encoded in network (big endian) order (must be >= 0).
72* `bytes` are the bytes of the byte array.
73
74### String encoding
75
76*String*s are first encoded to UTF-8, and then send as binary.
77
78### Double encoding
79
80Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit
81layout. Most run-times provide a library to make this conversion. Both the binary protocol as the compact protocol then
82encode the int64 in 8 bytes in big endian order.
83
84### Boolean encoding
85
86Values of `bool` type are first converted to an int8. True is converted to `1`, false to `0`.
87
88## Message
89
90A `Message` can be encoded in two different ways:
91
92```
93Binary protocol Message, strict encoding, 12+ bytes:
94+--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
95|1vvvvvvv|vvvvvvvv|unused |00000mmm| name length | name | seq id |
96+--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
97```
98
99Where:
100
101* `vvvvvvvvvvvvvvv` is the version, an unsigned 15 bit number fixed to `1` (in binary: `000 0000 0000 0001`).
102 The leading bit is `1`.
103* `unused` is an ignored byte.
104* `mmm` is the message type, an unsigned 3 bit integer. The 5 leading bits must be `0` as some clients (checked for
105 java in 0.9.1) take the whole byte.
106* `name length` is the byte length of the name field, a signed 32 bit integer encoded in network (big endian) order (must be >= 0).
107* `name` is the method name, a UTF-8 encoded string.
108* `seq id` is the sequence id, a signed 32 bit integer encoded in network (big endian) order.
109
110The second, older encoding (aka non-strict) is:
111
112```
113Binary protocol Message, old encoding, 9+ bytes:
114+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
115| name length | name |00000mmm| seq id |
116+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
117```
118
119Where `name length`, `name`, `mmm`, `seq id` are as above.
120
121Because `name length` must be positive (therefore the first bit is always `0`), the first bit allows the receiver to see
122whether the strict format or the old format is used. Therefore a server and client using the different variants of the
123binary protocol can transparently talk with each other. However, when strict mode is enforced, the old format is
124rejected.
125
126Message types are encoded with the following values:
127
128* _Call_: 1
129* _Reply_: 2
130* _Exception_: 3
131* _Oneway_: 4
132
133## Struct
134
135A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and
136is followed by the encoded field value. The encoding can be summarized by the following BNF:
137
138```
139struct ::= ( field-header field-value )* stop-field
140field-header ::= field-type field-id
141```
142
143Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any
144order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also
145possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to
146determine how to decode the field value.
147
148Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility.
149
150The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has
151another field-type then what is expected. Theoretically this could be detected at the cost of some additional checking.
152Other implementation may perform this check and then either ignore the field, or return a protocol exception.
153
154A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded.
155
156An *Exception* is encoded exactly the same as a struct.
157
158### Struct encoding
159
160In the binary protocol field headers and the stop field are encoded as follows:
161
162```
163Binary protocol field header and field value:
164+--------+--------+--------+--------+...+--------+
165|tttttttt| field id | field value |
166+--------+--------+--------+--------+...+--------+
167
168Binary protocol stop field:
169+--------+
170|00000000|
171+--------+
172```
173
174Where:
175
176* `tttttttt` the field-type, a signed 8 bit integer.
177* `field id` the field-id, a signed 16 bit integer in big endian order.
178* `field-value` the encoded field value.
179
180The following field-types are used:
181
182* `BOOL`, encoded as `2`
183* `BYTE`, encoded as `3`
184* `DOUBLE`, encoded as `4`
185* `I16`, encoded as `6`
186* `I32`, encoded as `8`
187* `I64`, encoded as `10`
188* `STRING`, used for binary and string fields, encoded as `11`
189* `STRUCT`, used for structs and union fields, encoded as `12`
190* `MAP`, encoded as `13`
191* `SET`, encoded as `14`
192* `LIST`, encoded as `15`
193
194## List and Set
195
196List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the
197encoded elements.
198
199```
200Binary protocol list (5+ bytes) and elements:
201+--------+--------+--------+--------+--------+--------+...+--------+
202|tttttttt| size | elements |
203+--------+--------+--------+--------+--------+--------+...+--------+
204```
205
206Where:
207
208* `tttttttt` is the element-type, encoded as an int8
209* `size` is the size, encoded as an int32, positive values only
210* `elements` the element values
211
212The element-type values are the same as field-types. The full list is included in the struct section above.
213
214The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2152147483647).
216
217## Map
218
219Maps are encoded with a header indicating the size, the element-type of the keys and the element-type of the elements,
220followed by the encoded elements. The encoding follows this BNF:
221
222```
223map ::= key-element-type value-element-type size ( key value )*
224```
225
226```
227Binary protocol map (6+ bytes) and key value pairs:
228+--------+--------+--------+--------+--------+--------+--------+...+--------+
229|kkkkkkkk|vvvvvvvv| size | key value pairs |
230+--------+--------+--------+--------+--------+--------+--------+...+--------+
231```
232
233Where:
234
235* `kkkkkkkk` is the key element-type, encoded as an int8
236* `vvvvvvvv` is the value element-type, encoded as an int8
237* `size` is the size of the map, encoded as an int32, positive values only
238* `key value pairs` are the encoded keys and values
239
240The element-type values are the same as field-types. The full list is included in the struct section above.
241
242The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2432147483647).
244
245# BNF notation used in this document
246
247The following BNF notation is used:
248
249* a plus `+` appended to an item represents repetition; the item is repeated 1 or more times
250* a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times
251* a pipe `|` between items represents choice, the first matching item is selected
252* parenthesis `(` and `)` are used for grouping multiple items