Blame - doc/specs/thrift-compact-protocol.md - packaging/sources/thrift

blob: 6be2a62f8cd7e4b27c7c0e5e9e618854e58c3b09 [file] [log] [blame] [view]

Erik van Oosten	3f5fa5f	2016-06-29 13:24:00 +0200	[diff] [blame]	1	Thrift Compact protocol encoding
				2	================================
				3
Jens Geyer	5767901	2016-09-21 22:18:44 +0200	[diff] [blame]	4	<!--
Erik van Oosten	3f5fa5f	2016-06-29 13:24:00 +0200	[diff] [blame]	5	--------------------------------------------------------------------
				6
				7	Licensed to the Apache Software Foundation (ASF) under one
				8	or more contributor license agreements. See the NOTICE file
				9	distributed with this work for additional information
				10	regarding copyright ownership. The ASF licenses this file
				11	to you under the Apache License, Version 2.0 (the
				12	"License"); you may not use this file except in compliance
				13	with the License. You may obtain a copy of the License at
				14
				15	http://www.apache.org/licenses/LICENSE-2.0
				16
				17	Unless required by applicable law or agreed to in writing,
				18	software distributed under the License is distributed on an
				19	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
				20	KIND, either express or implied. See the License for the
				21	specific language governing permissions and limitations
				22	under the License.
				23
				24	--------------------------------------------------------------------
Jens Geyer	5767901	2016-09-21 22:18:44 +0200	[diff] [blame]	25	-->
Erik van Oosten	3f5fa5f	2016-06-29 13:24:00 +0200	[diff] [blame]	26
				27	This documents describes the wire encoding for RPC using the Thrift compact protocol.
				28
				29	The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and
				30	[THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however,
				31	should behave the same.
				32
				33	For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf).
				34
				35	# Contents
				36
				37	* Compact protocol
				38	* Base types
				39	* Message
				40	* Struct
				41	* List and Set
				42	* Map
				43	* BNF notation used in this document
				44
				45	# Compact protocol
				46
				47	## Base types
				48
				49	### Integer encoding
				50
				51	The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_.
				52
				53	Values of type `int32` and `int64` are first transformed to a zigzag int. A zigzag int folds positive and negative
				54	numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1,
				55	-2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back:
				56
				57	```scala
				58	def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
				59	def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
				60	def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
				61	def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
				62	```
				63
				64	The zigzag int is then encoded as a var int. Var ints take 1 to 5 bytes (int32) or 1 to 10 bytes (int64). The most
				65	significant bit of each byte indicates if more bytes follow. The concatenation of the least significant 7 bits from each
				66	byte form the number, where the first byte has the most significant bits (so they are in big endian or network order).
				67
				68	Var ints are sometimes used directly inside the compact protocol to represent positive numbers.
				69
				70	To encode an `int16` as zigzag int, it is first converted to an `int32` and then encoded as such. The type `int8` simply
				71	uses a single byte as in the binary protocol.
				72
				73	### Enum encoding
				74
				75	The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32.
				76
				77	### Binary encoding
				78
				79	Binary is sent as follows:
				80
				81	```
				82	Binary protocol, binary data, 1+ bytes:
				83	+--------+...+--------+--------+...+--------+
				84	\| byte length \| bytes \|
				85	+--------+...+--------+--------+...+--------+
				86	```
				87
				88	Where:
				89
				90	* `byte length` is the length of the byte array, using var int encoding (must be >= 0).
				91	* `bytes` are the bytes of the byte array.
				92
				93	### String encoding
				94
				95	Strings are first encoded to UTF-8, and then send as binary.
				96
				97	### Double encoding
				98
				99	Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit
Jens Geyer	450bc69	2019-12-03 23:28:03 +0100	[diff] [blame^]	100	layout. Most run-times provide a library to make this conversion. But while the binary protocol encodes the int64
				101	in 8 bytes in big endian order, the compact protocol encodes it in little endian order - this is due to an early
				102	implementation bug that finally became the de-facto standard.
Erik van Oosten	3f5fa5f	2016-06-29 13:24:00 +0200	[diff] [blame]	103
				104	### Boolean encoding
				105
				106	Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set,
				107	list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an int8;
				108	true as `1` and false as `0`.
				109
				110	## Message
				111
				112	A `Message` on the wire looks as follows:
				113
				114	```
				115	Compact protocol Message (4+ bytes):
				116	+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
				117	\|pppppppp\|mmmvvvvv\| seq id \| name length \| name \|
				118	+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
				119	```
				120
				121	Where:
				122
				123	* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82.
				124	* `mmm` is the message type, an unsigned 3 bit integer.
				125	* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`.
				126	* `seq id` is the sequence id, a signed 32 bit integer encoded as a var int.
				127	* `name length` is the byte length of the name field, a signed 32 bit integer encoded as a var int (must be >= 0).
				128	* `name` is the method name to invoke, a UTF-8 encoded string.
				129
				130	Message types are encoded with the following values:
				131
				132	* _Call_: 1
				133	* _Reply_: 2
				134	* _Exception_: 3
				135	* _Oneway_: 4
				136
				137	### Struct
				138
				139	A Struct is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and
				140	is followed by the encoded field value. The encoding can be summarized by the following BNF:
				141
				142	```
				143	struct ::= ( field-header field-value )* stop-field
				144	field-header ::= field-type field-id
				145	```
				146
				147	Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any
				148	order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also
				149	possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to
				150	determine how to decode the field value.
				151
				152	Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility.
				153
				154	The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has
Klaus Trainer	e41e47c	2017-05-17 11:11:19 +0200	[diff] [blame]	155	another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking.
Erik van Oosten	3f5fa5f	2016-06-29 13:24:00 +0200	[diff] [blame]	156	Other implementation may perform this check and then either ignore the field, or return a protocol exception.
				157
				158	A Union is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded.
				159
				160	An Exception is encoded exactly the same as a struct.
				161
				162	### Struct encoding
				163
				164	```
				165	Compact protocol field header (short form) and field value:
				166	+--------+--------+...+--------+
				167	\|ddddtttt\| field value \|
				168	+--------+--------+...+--------+
				169
				170	Compact protocol field header (1 to 3 bytes, long form) and field value:
				171	+--------+--------+...+--------+--------+...+--------+
				172	\|0000tttt\| field id \| field value \|
				173	+--------+--------+...+--------+--------+...+--------+
				174
				175	Compact protocol stop field:
				176	+--------+
				177	\|00000000\|
				178	+--------+
				179	```
				180
				181	Where:
				182
				183	* `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive.
				184	* `tttt` is field-type id, an unsigned 4 bit integer.
				185	* `field id` the field id, a signed 16 bit integer encoded as zigzag int.
				186	* `field-value` the encoded field value.
				187
				188	The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the
				189	first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive).
				190
				191	The following field-types can be encoded:
				192
				193	* `BOOLEAN_TRUE`, encoded as `1`
				194	* `BOOLEAN_FALSE`, encoded as `2`
				195	* `BYTE`, encoded as `3`
				196	* `I16`, encoded as `4`
				197	* `I32`, encoded as `5`
				198	* `I64`, encoded as `6`
				199	* `DOUBLE`, encoded as `7`
				200	* `BINARY`, used for binary and string fields, encoded as `8`
				201	* `LIST`, encoded as `9`
				202	* `SET`, encoded as `10`
				203	* `MAP`, encoded as `11`
				204	* `STRUCT`, used for both structs and union fields, encoded as `12`
				205
				206	Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no
				207	length (0 bytes).
				208
				209	## List and Set
				210
				211	List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the
				212	encoded elements.
				213
				214	```
				215	Compact protocol list header (1 byte, short form) and elements:
				216	+--------+--------+...+--------+
				217	\|sssstttt\| elements \|
				218	+--------+--------+...+--------+
				219
				220	Compact protocol list header (2+ bytes, long form) and elements:
				221	+--------+--------+...+--------+--------+...+--------+
				222	\|1111tttt\| size \| elements \|
				223	+--------+--------+...+--------+--------+...+--------+
				224	```
				225
				226	Where:
				227
				228	* `ssss` is the size, 4 bit unsigned int, values `0` - `14`
				229	* `tttt` is the element-type, a 4 bit unsigned int
				230	* `size` is the size, a var int (int32), positive values `15` or higher
				231	* `elements` are the encoded elements
				232
				233	The short form should be used when the length is in the range 0 - 14 (inclusive).
				234
				235	The following element-types are used (note that these are _different_ from the field-types):
				236
				237	* `BOOL`, encoded as `2`
				238	* `BYTE`, encoded as `3`
				239	* `DOUBLE`, encoded as `4`
				240	* `I16`, encoded as `6`
				241	* `I32`, encoded as `8`
				242	* `I64`, encoded as `10`
				243	* `STRING`, used for binary and string fields, encoded as `11`
				244	* `STRUCT`, used for structs and union fields, encoded as `12`
				245	* `MAP`, encoded as `13`
				246	* `SET`, encoded as `14`
				247	* `LIST`, encoded as `15`
				248
				249
				250	The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
				251	2147483647).
				252
				253	## Map
				254
				255	Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed
				256	by the encoded elements. The encoding follows this BNF:
				257
				258	```
				259	map ::= empty-map \| non-empty-map
				260	empty-map ::= `0`
				261	non-empty-map ::= size key-element-type value-element-type (key value)+
				262	```
				263
				264	```
				265	Compact protocol map header (1 byte, empty map):
				266	+--------+
				267	\|00000000\|
				268	+--------+
				269
				270	Compact protocol map header (2+ bytes, non empty map) and key value pairs:
				271	+--------+...+--------+--------+--------+...+--------+
				272	\| size \|kkkkvvvv\| key value pairs \|
				273	+--------+...+--------+--------+--------+...+--------+
				274	```
				275
				276	Where:
				277
				278	* `size` is the size, a var int (int32), strictly positive values
				279	* `kkkk` is the key element-type, a 4 bit unsigned int
				280	* `vvvv` is the value element-type, a 4 bit unsigned int
				281	* `key value pairs` are the encoded keys and values
				282
				283	The element-types are the same as for lists. The full list is included in the 'List and set' section.
				284
				285	The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
				286	2147483647).
				287
				288	# BNF notation used in this document
				289
				290	The following BNF notation is used:
				291
				292	* a plus `+` appended to an item represents repetition; the item is repeated 1 or more times
				293	* a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times
				294	* a pipe `\|` between items represents choice, the first matching item is selected
				295	* parenthesis `(` and `)` are used for grouping multiple items