Blame - README.rst - salt-formulas/ceph

blob: 6a62baf8a8ea0376ab25b1ebd34d7ed4dd0e414d [file] [log] [blame]

jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	1
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	2	============
				3	Ceph formula
				4	============
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	5
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	6	Ceph provides extraordinary data storage scalability. Thousands of client
				7	hosts or KVMs accessing petabytes to exabytes of data. Each one of your
				8	applications can use the object, block or file system interfaces to the same
				9	RADOS cluster simultaneously, which means your Ceph storage system serves as a
				10	flexible foundation for all of your data storage needs.
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	11
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	12	Use salt-formula-linux for initial disk partitioning.
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	13
				14
Tomáš Kukrál	d2b8297	2017-08-29 12:45:45 +0200	[diff] [blame]	15	Daemons
				16	--------
				17
				18	Ceph uses several daemons to handle data and cluster state. Each daemon type requires different computing capacity and hardware optimization.
				19
				20	These daemons are currently supported by formula:
				21
				22	* MON (`ceph.mon`)
				23	* OSD (`ceph.osd`)
				24	* RGW (`ceph.radosgw`)
				25
				26
				27	Architecture decisions
				28	-----------------------
				29
				30	Please refer to upstream achritecture documents before designing your cluster. Solid understanding of Ceph principles is essential for making architecture decisions described bellow.
				31	http://docs.ceph.com/docs/master/architecture/
				32
				33	* Ceph version
				34
				35	There is 3 or 4 stable releases every year and many of nighty/dev release. You should decide which version will be used since the only stable releases are recommended for production. Some of the releases are marked LTS (Long Term Stable) and these releases receive bugfixed for longer period - usually until next LTS version is released.
				36
				37	* Number of MON daemons
				38
				39	Use 1 MON daemon for testing, 3 MONs for smaller production clusters and 5 MONs for very large production cluster. There is no need to have more than 5 MONs in normal environment because there isn't any significant benefit in running more than 5 MONs. Ceph require MONS to form quorum so you need to heve more than 50% of the MONs up and running to have fully operational cluster. Every I/O operation will stop once less than 50% MONs is availabe because they can't form quorum.
				40
				41	* Number of PGs
				42
				43	Placement groups are providing mappping between stored data and OSDs. It is necessary to calculate number of PGs because there should be stored decent amount of PGs on each OSD. Please keep in mind decreasing number of PGs isn't possible and increading can affect cluster performance.
				44
				45	http://docs.ceph.com/docs/master/rados/operations/placement-groups/
				46	http://ceph.com/pgcalc/
				47
				48	* Daemon colocation
				49
				50	It is recommended to dedicate nodes for MONs and RWG since colocation can have and influence on cluster operations. Howerver, small clusters can be running MONs on OSD node but it is critical to have enough of resources for MON daemons because they are the most important part of the cluster.
				51
				52	Installing RGW on node with other daemons isn't recommended because RGW daemon usually require a lot of bandwith and it harm cluster health.
				53
				54	* Journal location
				55
				56	There are two way to setup journal:
				57	* Colocated journal is located (usually at the beginning) on the same disk as partition for the data. This setup is easier for installation and it doesn't require any other disk to be used. However, colocated setup is significantly slower than dedicated)
				58	* Dedicate journal is placed on different disk than data. This setup can deliver much higher performance than colocated but it require to have more disks in servers. Journal drives should be carefully selected because high I/O and durability is required.
				59
				60	* Store type (Bluestore/Filestore)
				61
				62	Recent version of Ceph support Bluestore as storage backend and backend should be used if available.
				63
				64	http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
				65
				66	* Cluster and public network
				67
				68	Ceph cluster is accessed using network and thus you need to have decend capacity to handle all the client. There are two networks required for cluster: public network and cluster network. Public network is used for client connections and MONs and OSDs are listening on this network. Second network ic called cluster networks and this network is used for communication between OSDs.
				69
				70	Both networks should have dedicated interfaces, bonding interfaces and dedicating vlans on bonded interfaces isn't allowed. Good practise is dedicate more throughput for the cluster network because cluster traffic is more important than client traffic.
				71
				72	* Pool parameters (size, min_size, type)
				73
				74	You should setup each pool according to it's expected usage, at least `min_size` and `size` and pool type should be considered.
				75
				76	* Cluster monitoring
				77
				78	* Hardware
				79
				80	Please refer to upstream hardware recommendation guide for general information about hardware.
				81
				82	Ceph servers are required to fulfil special requirements becauce load generated by Ceph can be diametrically opposed to common load.
				83
				84	http://docs.ceph.com/docs/master/start/hardware-recommendations/
				85
				86
				87	Basic management commands
				88	------------------------------
				89
				90	Cluster
				91	********
				92
				93	- :code:`ceph health` - check if cluster is healthy (:code:`ceph health detail` can provide more information)
				94
				95
				96	.. code-block:: bash
				97
				98	root@c-01:~# ceph health
				99	HEALTH_OK
				100
				101	- :code:`ceph status` - shows basic information about cluster
				102
				103
				104	.. code-block:: bash
				105
				106	root@c-01:~# ceph status
				107	cluster e2dc51ae-c5e4-48f0-afc1-9e9e97dfd650
				108	health HEALTH_OK
				109	monmap e1: 3 mons at {1=192.168.31.201:6789/0,2=192.168.31.202:6789/0,3=192.168.31.203:6789/0}
				110	election epoch 38, quorum 0,1,2 1,2,3
				111	osdmap e226: 6 osds: 6 up, 6 in
				112	pgmap v27916: 400 pgs, 2 pools, 21233 MB data, 5315 objects
				113	121 GB used, 10924 GB / 11058 GB avail
				114	400 active+clean
				115	client io 481 kB/s rd, 132 kB/s wr, 185 op/
				116
				117	MON
				118	****
				119
				120	http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
				121
				122	OSD
				123	****
				124
				125	http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
				126
				127	- :code:`ceph osd tree` - show all OSDs and it's state
				128
				129	.. code-block:: bash
				130
				131	root@c-01:~# ceph osd tree
				132	ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
				133	-4 0 host c-04
				134	-1 10.79993 root default
				135	-2 3.59998 host c-01
				136	0 1.79999 osd.0 up 1.00000 1.00000
				137	1 1.79999 osd.1 up 1.00000 1.00000
				138	-3 3.59998 host c-02
				139	2 1.79999 osd.2 up 1.00000 1.00000
				140	3 1.79999 osd.3 up 1.00000 1.00000
				141	-5 3.59998 host c-03
				142	4 1.79999 osd.4 up 1.00000 1.00000
				143	5 1.79999 osd.5 up 1.00000 1.00000
				144
				145	- :code:`ceph osd pools ls` - list of pool
				146
				147	.. code-block:: bash
				148
				149	root@c-01:~# ceph osd lspools
				150	0 rbd,1 test
				151
				152	PG
				153	***
				154
				155	http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg
				156
				157	- :code:`ceph pg ls` - list placement groups
				158
				159	.. code-block:: bash
				160
				161	root@c-01:~# ceph pg ls \| head -n 4
				162	pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
				163	0.0 11 0 0 0 0 46137344 3044 3044 active+clean 2015-07-02 10:12:40.603692 226'10652 226:1798 [4,2,0] 4 [4,2,0] 4 0'0 2015-07-01 18:38:33.126953 0'0 2015-07-01 18:17:01.904194
				164	0.1 7 0 0 0 0 25165936 3026 3026 active+clean 2015-07-02 10:12:40.585833 226'5808 226:1070 [2,4,1] 2 [2,4,1] 2 0'0 2015-07-01 18:38:32.352721 0'0 2015-07-01 18:17:01.904198
				165	0.2 18 0 0 0 0 75497472 3039 3039 active+clean 2015-07-02 10:12:39.569630 226'17447 226:3213 [3,1,5] 3 [3,1,5] 3 0'0 2015-07-01 18:38:34.308228 0'0 2015-07-01 18:17:01.904199
				166
				167	- :code:`ceph pg map 1.1` - show mapping between PG and OSD
				168
				169	.. code-block:: bash
				170
				171	root@c-01:~# ceph pg map 1.1
				172	osdmap e226 pg 1.1 (1.1) -> up [5,1,2] acting [5,1,2]
				173
				174
				175
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	176	Sample pillars
				177	==============
				178
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	179	Common metadata for all nodes/roles
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	180
				181	.. code-block:: yaml
				182
				183	ceph:
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	184	common:
Jiri Broulik	d572904	2017-09-19 20:07:22 +0200	[diff] [blame^]	185	version: luminous
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	186	config:
				187	global:
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	188	param1: value1
				189	param2: value1
				190	param3: value1
				191	pool_section:
				192	param1: value2
				193	param2: value2
				194	param3: value2
				195	fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
				196	members:
				197	- name: cmn01
				198	host: 10.0.0.1
				199	- name: cmn02
				200	host: 10.0.0.2
				201	- name: cmn03
				202	host: 10.0.0.3
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	203	keyring:
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	204	admin:
				205	key: AQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
				206	caps:
				207	mds: "allow *"
				208	mgr: "allow *"
				209	mon: "allow *"
				210	osd: "allow *"
Jiri Broulik	d572904	2017-09-19 20:07:22 +0200	[diff] [blame^]	211	bootstrap-osd:
				212	key: BQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
				213	caps:
				214	mon: "allow profile bootstrap-osd"
				215
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	216
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	217	Optional definition for cluster and public networks. Cluster network is used
				218	for replication. Public network for front-end communication.
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	219
				220	.. code-block:: yaml
				221
				222	ceph:
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	223	common:
Jiri Broulik	d572904	2017-09-19 20:07:22 +0200	[diff] [blame^]	224	version: luminous
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	225	fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
				226	....
				227	public_network: 10.0.0.0/24, 10.1.0.0/24
				228	cluster_network: 10.10.0.0/24, 10.11.0.0/24
				229
				230
				231	Ceph mon (control) roles
				232	------------------------
				233
				234	Monitors: A Ceph Monitor maintains maps of the cluster state, including the
				235	monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map.
				236	Ceph maintains a history (called an “epoch”) of each state change in the Ceph
				237	Monitors, Ceph OSD Daemons, and PGs.
				238
				239	.. code-block:: yaml
				240
				241	ceph:
				242	common:
				243	config:
				244	mon:
				245	key: value
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	246	mon:
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	247	enabled: true
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	248	keyring:
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	249	mon:
				250	key: AQAnQIhZ6in5KxAAdf467upoRMWFcVg5pbh1yg==
				251	caps:
				252	mon: "allow *"
				253	admin:
				254	key: AQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
				255	caps:
				256	mds: "allow *"
				257	mgr: "allow *"
				258	mon: "allow *"
				259	osd: "allow *"
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	260
Ondrej Smola	91c8316	2017-09-12 16:40:02 +0200	[diff] [blame]	261	Ceph mgr roles
				262	------------------------
				263
				264	The Ceph Manager daemon (ceph-mgr) runs alongside monitor daemons, to provide additional monitoring and interfaces to external monitoring and management systems. Since the 12.x (luminous) Ceph release, the ceph-mgr daemon is required for normal operations. The ceph-mgr daemon is an optional component in the 11.x (kraken) Ceph release.
				265
				266	By default, the manager daemon requires no additional configuration, beyond ensuring it is running. If there is no mgr daemon running, you will see a health warning to that effect, and some of the other information in the output of ceph status will be missing or stale until a mgr is started.
				267
				268
				269	.. code-block:: yaml
				270
				271	ceph:
				272	mgr:
				273	enabled: true
				274	dashboard:
				275	enabled: true
				276	host: 10.103.255.252
				277	port: 7000
				278
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	279
				280	Ceph OSD (storage) roles
				281	------------------------
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	282
				283	.. code-block:: yaml
				284
				285	ceph:
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	286	common:
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	287	config:
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	288	osd:
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	289	key: value
				290	osd:
				291	enabled: true
Jiri Broulik	d572904	2017-09-19 20:07:22 +0200	[diff] [blame^]	292	ceph_host_id: '39'
				293	journal_size: 20480
				294	bluestore_block_db_size: 1073741824 (1G)
				295	bluestore_block_wal_size: 1073741824 (1G)
				296	bluestore_block_size: 807374182400 (800G)
				297	backend:
				298	filestore:
				299	disks:
				300	- dev: /dev/sdm
				301	enabled: false
				302	rule: hdd
				303	journal: /dev/ssd
				304	fs_type: xfs
				305	class: bestssd
				306	weight: 1.5
				307	- dev: /dev/sdl
				308	rule: hdd
				309	journal: /dev/ssd
				310	fs_type: xfs
				311	class: bestssd
				312	weight: 1.5
				313	bluestore:
				314	disks:
				315	- dev: /dev/sdb
				316	- dev: /dev/sdc
				317	block_db: /dev/ssd
				318	block_wal: /dev/ssd
				319	- dev: /dev/sdd
				320	enabled: false
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	321
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	322
				323	Ceph client roles
				324	-----------------
				325
				326	Simple ceph client service
Simon Pasquier	f8e6f9e	2017-07-03 10:15:20 +0200	[diff] [blame]	327
				328	.. code-block:: yaml
				329
				330	ceph:
				331	client:
				332	config:
				333	global:
				334	mon initial members: ceph1,ceph2,ceph3
				335	mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789
				336	keyring:
				337	monitoring:
				338	key: 00000000000000000000000000000000000000==
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	339
				340	At OpenStack control settings are usually located at cinder-volume or glance-
				341	registry services.
				342
				343	.. code-block:: yaml
				344
				345	ceph:
				346	client:
				347	config:
				348	global:
				349	fsid: 00000000-0000-0000-0000-000000000000
				350	mon initial members: ceph1,ceph2,ceph3
				351	mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789
				352	osd_fs_mkfs_arguments_xfs:
				353	osd_fs_mount_options_xfs: rw,noatime
				354	network public: 10.0.0.0/24
				355	network cluster: 10.0.0.0/24
				356	osd_fs_type: xfs
				357	osd:
				358	osd journal size: 7500
				359	filestore xattr use omap: true
				360	mon:
				361	mon debug dump transactions: false
				362	keyring:
				363	cinder:
				364	key: 00000000000000000000000000000000000000==
				365	glance:
				366	key: 00000000000000000000000000000000000000==
				367
				368
				369	Ceph gateway
				370	------------
				371
				372	Rados gateway with keystone v2 auth backend
				373
				374	.. code-block:: yaml
				375
				376	ceph:
				377	radosgw:
				378	enabled: true
				379	hostname: gw.ceph.lab
				380	bind:
				381	address: 10.10.10.1
				382	port: 8080
				383	identity:
				384	engine: keystone
				385	api_version: 2
				386	host: 10.10.10.100
				387	port: 5000
				388	user: admin
				389	password: password
				390	tenant: admin
				391
				392	Rados gateway with keystone v3 auth backend
				393
				394	.. code-block:: yaml
				395
				396	ceph:
				397	radosgw:
				398	enabled: true
				399	hostname: gw.ceph.lab
				400	bind:
				401	address: 10.10.10.1
				402	port: 8080
				403	identity:
				404	engine: keystone
				405	api_version: 3
				406	host: 10.10.10.100
				407	port: 5000
				408	user: admin
				409	password: password
				410	project: admin
				411	domain: default
				412
				413
				414	Ceph setup role
				415	---------------
				416
				417	Replicated ceph storage pool
				418
				419	.. code-block:: yaml
				420
				421	ceph:
				422	setup:
				423	pool:
				424	replicated_pool:
				425	pg_num: 256
				426	pgp_num: 256
				427	type: replicated
				428	crush_ruleset_name: 0
				429
				430	Erasure ceph storage pool
				431
				432	.. code-block:: yaml
				433
				434	ceph:
				435	setup:
				436	pool:
				437	erasure_pool:
				438	pg_num: 256
				439	pgp_num: 256
				440	type: erasure
				441	crush_ruleset_name: 0
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	442
Tomáš Kukrál	363d37d	2017-08-17 13:40:20 +0200	[diff] [blame]	443	Generate CRUSH map
Tomáš Kukrál	d2b8297	2017-08-29 12:45:45 +0200	[diff] [blame]	444	--------------------
Tomáš Kukrál	363d37d	2017-08-17 13:40:20 +0200	[diff] [blame]	445
				446	It is required to define the `type` for crush buckets and these types must start with `root` (top) and end with `host`. OSD daemons will be assigned to hosts according to it's hostname. Weight of the buckets will be calculated according to weight of it's childen.
				447
				448	.. code-block:: yaml
				449
Tomáš Kukrál	9ddb95b	2017-08-17 14:18:51 +0200	[diff] [blame]	450	ceph:
				451	setup:
				452	crush:
				453	enabled: True
				454	tunables:
				455	choose_total_tries: 50
				456	type:
				457	- root
				458	- region
				459	- rack
				460	- host
				461	root:
				462	- name: root1
				463	- name: root2
				464	region:
				465	- name: eu-1
				466	parent: root1
				467	- name: eu-2
				468	parent: root1
				469	- name: us-1
				470	parent: root2
				471	rack:
				472	- name: rack01
				473	parent: eu-1
				474	- name: rack02
				475	parent: eu-2
				476	- name: rack03
				477	parent: us-1
				478	rule:
				479	sata:
				480	ruleset: 0
				481	type: replicated
				482	min_size: 1
				483	max_size: 10
				484	steps:
				485	- take crushroot.performanceblock.satahss.1
				486	- choseleaf firstn 0 type failure_domain
				487	- emit
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	488
				489	Ceph monitoring
				490	---------------
				491
				492	Collect general cluster metrics
				493
				494	.. code-block:: yaml
				495
				496	ceph:
Simon Pasquier	f8e6f9e	2017-07-03 10:15:20 +0200	[diff] [blame]	497	monitoring:
				498	cluster_stats:
				499	enabled: true
				500	ceph_user: monitoring
				501
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	502	Collect metrics from monitor and OSD services
Simon Pasquier	f8e6f9e	2017-07-03 10:15:20 +0200	[diff] [blame]	503
				504	.. code-block:: yaml
				505
				506	ceph:
				507	monitoring:
				508	node_stats:
				509	enabled: true
				510
				511
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	512	More information
				513	================
jpavlik	8425d36	2015-06-09 15:23:27 +0200	[diff] [blame]	514
				515	* https://github.com/cloud-ee/ceph-salt-formula
				516	* http://ceph.com/ceph-storage/
jan kaufman	4f7757b	2015-06-12 10:49:00 +0200	[diff] [blame]	517	* http://ceph.com/docs/master/start/intro/
Filip Pytloun	32841d7	2017-02-02 13:02:03 +0100	[diff] [blame]	518
Ondrej Smola	81d1a19	2017-08-17 11:13:10 +0200	[diff] [blame]	519
				520	Documentation and bugs
Filip Pytloun	32841d7	2017-02-02 13:02:03 +0100	[diff] [blame]	521	======================
				522
				523	To learn how to install and update salt-formulas, consult the documentation
				524	available online at:
				525
				526	http://salt-formulas.readthedocs.io/
				527
				528	In the unfortunate event that bugs are discovered, they should be reported to
				529	the appropriate issue tracker. Use Github issue tracker for specific salt
				530	formula:
				531
				532	https://github.com/salt-formulas/salt-formula-ceph/issues
				533
				534	For feature requests, bug reports or blueprints affecting entire ecosystem,
				535	use Launchpad salt-formulas project:
				536
				537	https://launchpad.net/salt-formulas
				538
				539	You can also join salt-formulas-users team and subscribe to mailing list:
				540
				541	https://launchpad.net/~salt-formulas-users
				542
				543	Developers wishing to work on the salt-formulas projects should always base
				544	their work on master branch and submit pull request against specific formula.
				545
				546	https://github.com/salt-formulas/salt-formula-ceph
				547
				548	Any questions or feedback is always welcome so feel free to join our IRC
				549	channel:
				550
				551	#salt-formulas @ irc.freenode.net