blob: 1d41a52ad82f9243f9054d04eed09b055d4e3652 [file] [log] [blame]
jpavlik8425d362015-06-09 15:23:27 +02001
Ondrej Smola81d1a192017-08-17 11:13:10 +02002============
3Ceph formula
4============
jpavlik8425d362015-06-09 15:23:27 +02005
Ondrej Smola81d1a192017-08-17 11:13:10 +02006Ceph provides extraordinary data storage scalability. Thousands of client
7hosts or KVMs accessing petabytes to exabytes of data. Each one of your
8applications can use the object, block or file system interfaces to the same
9RADOS cluster simultaneously, which means your Ceph storage system serves as a
10flexible foundation for all of your data storage needs.
jpavlik8425d362015-06-09 15:23:27 +020011
Ondrej Smola81d1a192017-08-17 11:13:10 +020012Use salt-formula-linux for initial disk partitioning.
jpavlik8425d362015-06-09 15:23:27 +020013
14
Tomáš Kukráld2b82972017-08-29 12:45:45 +020015Daemons
16--------
17
18Ceph uses several daemons to handle data and cluster state. Each daemon type requires different computing capacity and hardware optimization.
19
20These daemons are currently supported by formula:
21
22* MON (`ceph.mon`)
23* OSD (`ceph.osd`)
24* RGW (`ceph.radosgw`)
25
26
27Architecture decisions
28-----------------------
29
30Please refer to upstream achritecture documents before designing your cluster. Solid understanding of Ceph principles is essential for making architecture decisions described bellow.
31http://docs.ceph.com/docs/master/architecture/
32
33* Ceph version
34
35There is 3 or 4 stable releases every year and many of nighty/dev release. You should decide which version will be used since the only stable releases are recommended for production. Some of the releases are marked LTS (Long Term Stable) and these releases receive bugfixed for longer period - usually until next LTS version is released.
36
37* Number of MON daemons
38
39Use 1 MON daemon for testing, 3 MONs for smaller production clusters and 5 MONs for very large production cluster. There is no need to have more than 5 MONs in normal environment because there isn't any significant benefit in running more than 5 MONs. Ceph require MONS to form quorum so you need to heve more than 50% of the MONs up and running to have fully operational cluster. Every I/O operation will stop once less than 50% MONs is availabe because they can't form quorum.
40
41* Number of PGs
42
43Placement groups are providing mappping between stored data and OSDs. It is necessary to calculate number of PGs because there should be stored decent amount of PGs on each OSD. Please keep in mind *decreasing number of PGs* isn't possible and *increading* can affect cluster performance.
44
45http://docs.ceph.com/docs/master/rados/operations/placement-groups/
46http://ceph.com/pgcalc/
47
48* Daemon colocation
49
50It is recommended to dedicate nodes for MONs and RWG since colocation can have and influence on cluster operations. Howerver, small clusters can be running MONs on OSD node but it is critical to have enough of resources for MON daemons because they are the most important part of the cluster.
51
52Installing RGW on node with other daemons isn't recommended because RGW daemon usually require a lot of bandwith and it harm cluster health.
53
54* Journal location
55
56There are two way to setup journal:
57 * **Colocated** journal is located (usually at the beginning) on the same disk as partition for the data. This setup is easier for installation and it doesn't require any other disk to be used. However, colocated setup is significantly slower than dedicated)
58 * **Dedicate** journal is placed on different disk than data. This setup can deliver much higher performance than colocated but it require to have more disks in servers. Journal drives should be carefully selected because high I/O and durability is required.
59
60* Store type (Bluestore/Filestore)
61
62Recent version of Ceph support Bluestore as storage backend and backend should be used if available.
63
64http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
65
66* Cluster and public network
67
68Ceph cluster is accessed using network and thus you need to have decend capacity to handle all the client. There are two networks required for cluster: **public** network and cluster network. Public network is used for client connections and MONs and OSDs are listening on this network. Second network ic called **cluster** networks and this network is used for communication between OSDs.
69
70Both networks should have dedicated interfaces, bonding interfaces and dedicating vlans on bonded interfaces isn't allowed. Good practise is dedicate more throughput for the cluster network because cluster traffic is more important than client traffic.
71
72* Pool parameters (size, min_size, type)
73
74You should setup each pool according to it's expected usage, at least `min_size` and `size` and pool type should be considered.
75
76* Cluster monitoring
77
78* Hardware
79
80Please refer to upstream hardware recommendation guide for general information about hardware.
81
82Ceph servers are required to fulfil special requirements becauce load generated by Ceph can be diametrically opposed to common load.
83
84http://docs.ceph.com/docs/master/start/hardware-recommendations/
85
86
87Basic management commands
88------------------------------
89
90Cluster
91********
92
93- :code:`ceph health` - check if cluster is healthy (:code:`ceph health detail` can provide more information)
94
95
96.. code-block:: bash
97
98 root@c-01:~# ceph health
99 HEALTH_OK
100
101- :code:`ceph status` - shows basic information about cluster
102
103
104.. code-block:: bash
105
106 root@c-01:~# ceph status
107 cluster e2dc51ae-c5e4-48f0-afc1-9e9e97dfd650
108 health HEALTH_OK
109 monmap e1: 3 mons at {1=192.168.31.201:6789/0,2=192.168.31.202:6789/0,3=192.168.31.203:6789/0}
110 election epoch 38, quorum 0,1,2 1,2,3
111 osdmap e226: 6 osds: 6 up, 6 in
112 pgmap v27916: 400 pgs, 2 pools, 21233 MB data, 5315 objects
113 121 GB used, 10924 GB / 11058 GB avail
114 400 active+clean
115 client io 481 kB/s rd, 132 kB/s wr, 185 op/
116
117MON
118****
119
120http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
121
122OSD
123****
124
125http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
126
127- :code:`ceph osd tree` - show all OSDs and it's state
128
129.. code-block:: bash
130
131 root@c-01:~# ceph osd tree
132 ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
133 -4 0 host c-04
134 -1 10.79993 root default
135 -2 3.59998 host c-01
136 0 1.79999 osd.0 up 1.00000 1.00000
137 1 1.79999 osd.1 up 1.00000 1.00000
138 -3 3.59998 host c-02
139 2 1.79999 osd.2 up 1.00000 1.00000
140 3 1.79999 osd.3 up 1.00000 1.00000
141 -5 3.59998 host c-03
142 4 1.79999 osd.4 up 1.00000 1.00000
143 5 1.79999 osd.5 up 1.00000 1.00000
144
145- :code:`ceph osd pools ls` - list of pool
146
147.. code-block:: bash
148
149 root@c-01:~# ceph osd lspools
150 0 rbd,1 test
151
152PG
153***
154
155http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg
156
157- :code:`ceph pg ls` - list placement groups
158
159.. code-block:: bash
160
161 root@c-01:~# ceph pg ls | head -n 4
162 pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
163 0.0 11 0 0 0 0 46137344 3044 3044 active+clean 2015-07-02 10:12:40.603692 226'10652 226:1798 [4,2,0] 4 [4,2,0] 4 0'0 2015-07-01 18:38:33.126953 0'0 2015-07-01 18:17:01.904194
164 0.1 7 0 0 0 0 25165936 3026 3026 active+clean 2015-07-02 10:12:40.585833 226'5808 226:1070 [2,4,1] 2 [2,4,1] 2 0'0 2015-07-01 18:38:32.352721 0'0 2015-07-01 18:17:01.904198
165 0.2 18 0 0 0 0 75497472 3039 3039 active+clean 2015-07-02 10:12:39.569630 226'17447 226:3213 [3,1,5] 3 [3,1,5] 3 0'0 2015-07-01 18:38:34.308228 0'0 2015-07-01 18:17:01.904199
166
167- :code:`ceph pg map 1.1` - show mapping between PG and OSD
168
169.. code-block:: bash
170
171 root@c-01:~# ceph pg map 1.1
172 osdmap e226 pg 1.1 (1.1) -> up [5,1,2] acting [5,1,2]
173
174
175
jpavlik8425d362015-06-09 15:23:27 +0200176Sample pillars
177==============
178
Ondrej Smola81d1a192017-08-17 11:13:10 +0200179Common metadata for all nodes/roles
jpavlik8425d362015-06-09 15:23:27 +0200180
181.. code-block:: yaml
182
183 ceph:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200184 common:
Jiri Broulikd5729042017-09-19 20:07:22 +0200185 version: luminous
jpavlik8425d362015-06-09 15:23:27 +0200186 config:
187 global:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200188 param1: value1
189 param2: value1
190 param3: value1
191 pool_section:
192 param1: value2
193 param2: value2
194 param3: value2
195 fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
196 members:
197 - name: cmn01
198 host: 10.0.0.1
199 - name: cmn02
200 host: 10.0.0.2
201 - name: cmn03
202 host: 10.0.0.3
jpavlik8425d362015-06-09 15:23:27 +0200203 keyring:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200204 admin:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200205 caps:
206 mds: "allow *"
207 mgr: "allow *"
208 mon: "allow *"
209 osd: "allow *"
Jiri Broulikd5729042017-09-19 20:07:22 +0200210 bootstrap-osd:
Jiri Broulikd5729042017-09-19 20:07:22 +0200211 caps:
212 mon: "allow profile bootstrap-osd"
213
jpavlik8425d362015-06-09 15:23:27 +0200214
Ondrej Smola81d1a192017-08-17 11:13:10 +0200215Optional definition for cluster and public networks. Cluster network is used
216for replication. Public network for front-end communication.
jpavlik8425d362015-06-09 15:23:27 +0200217
218.. code-block:: yaml
219
220 ceph:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200221 common:
Jiri Broulikd5729042017-09-19 20:07:22 +0200222 version: luminous
Ondrej Smola81d1a192017-08-17 11:13:10 +0200223 fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
224 ....
225 public_network: 10.0.0.0/24, 10.1.0.0/24
226 cluster_network: 10.10.0.0/24, 10.11.0.0/24
227
228
229Ceph mon (control) roles
230------------------------
231
232Monitors: A Ceph Monitor maintains maps of the cluster state, including the
233monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map.
234Ceph maintains a history (called an “epoch”) of each state change in the Ceph
235Monitors, Ceph OSD Daemons, and PGs.
236
237.. code-block:: yaml
238
239 ceph:
240 common:
241 config:
242 mon:
243 key: value
jpavlik8425d362015-06-09 15:23:27 +0200244 mon:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200245 enabled: true
jpavlik8425d362015-06-09 15:23:27 +0200246 keyring:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200247 mon:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200248 caps:
249 mon: "allow *"
250 admin:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200251 caps:
252 mds: "allow *"
253 mgr: "allow *"
254 mon: "allow *"
255 osd: "allow *"
jpavlik8425d362015-06-09 15:23:27 +0200256
Ondrej Smola91c83162017-09-12 16:40:02 +0200257Ceph mgr roles
258------------------------
259
260The Ceph Manager daemon (ceph-mgr) runs alongside monitor daemons, to provide additional monitoring and interfaces to external monitoring and management systems. Since the 12.x (luminous) Ceph release, the ceph-mgr daemon is required for normal operations. The ceph-mgr daemon is an optional component in the 11.x (kraken) Ceph release.
261
262By default, the manager daemon requires no additional configuration, beyond ensuring it is running. If there is no mgr daemon running, you will see a health warning to that effect, and some of the other information in the output of ceph status will be missing or stale until a mgr is started.
263
264
265.. code-block:: yaml
266
267 ceph:
268 mgr:
269 enabled: true
270 dashboard:
271 enabled: true
272 host: 10.103.255.252
273 port: 7000
274
Ondrej Smola81d1a192017-08-17 11:13:10 +0200275
276Ceph OSD (storage) roles
277------------------------
jpavlik8425d362015-06-09 15:23:27 +0200278
279.. code-block:: yaml
280
281 ceph:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200282 common:
Jiri Broulikec62dec2017-10-10 13:45:15 +0200283 version: luminous
284 fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
285 public_network: 10.0.0.0/24, 10.1.0.0/24
286 cluster_network: 10.10.0.0/24, 10.11.0.0/24
287 keyring:
288 bootstrap-osd:
289 caps:
290 mon: "allow profile bootstrap-osd"
291 ....
Ondrej Smola81d1a192017-08-17 11:13:10 +0200292 osd:
293 enabled: true
Jiri Broulikec62dec2017-10-10 13:45:15 +0200294 crush_parent: rack01
295 journal_size: 20480 (20G)
296 bluestore_block_db_size: 10073741824 (10G)
297 bluestore_block_wal_size: 10073741824 (10G)
Jiri Broulikd5729042017-09-19 20:07:22 +0200298 bluestore_block_size: 807374182400 (800G)
299 backend:
300 filestore:
301 disks:
302 - dev: /dev/sdm
303 enabled: false
Jiri Broulikd5729042017-09-19 20:07:22 +0200304 journal: /dev/ssd
305 fs_type: xfs
306 class: bestssd
307 weight: 1.5
308 - dev: /dev/sdl
Jiri Broulikd5729042017-09-19 20:07:22 +0200309 journal: /dev/ssd
310 fs_type: xfs
311 class: bestssd
312 weight: 1.5
313 bluestore:
314 disks:
315 - dev: /dev/sdb
316 - dev: /dev/sdc
317 block_db: /dev/ssd
318 block_wal: /dev/ssd
Jiri Broulikc2be93b2017-10-03 14:20:00 +0200319 class: ssd
320 weight: 1.666
Jiri Broulikd5729042017-09-19 20:07:22 +0200321 - dev: /dev/sdd
322 enabled: false
jpavlik8425d362015-06-09 15:23:27 +0200323
Ondrej Smola81d1a192017-08-17 11:13:10 +0200324
Jiri Broulikc2be93b2017-10-03 14:20:00 +0200325Ceph client roles - ...Deprecated - use ceph:common instead
326--------------------------------------------------------
Ondrej Smola81d1a192017-08-17 11:13:10 +0200327
328Simple ceph client service
Simon Pasquierf8e6f9e2017-07-03 10:15:20 +0200329
330.. code-block:: yaml
331
332 ceph:
333 client:
334 config:
335 global:
336 mon initial members: ceph1,ceph2,ceph3
337 mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789
338 keyring:
339 monitoring:
340 key: 00000000000000000000000000000000000000==
Ondrej Smola81d1a192017-08-17 11:13:10 +0200341
342At OpenStack control settings are usually located at cinder-volume or glance-
343registry services.
344
345.. code-block:: yaml
346
347 ceph:
348 client:
349 config:
350 global:
351 fsid: 00000000-0000-0000-0000-000000000000
352 mon initial members: ceph1,ceph2,ceph3
353 mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789
354 osd_fs_mkfs_arguments_xfs:
355 osd_fs_mount_options_xfs: rw,noatime
356 network public: 10.0.0.0/24
357 network cluster: 10.0.0.0/24
358 osd_fs_type: xfs
359 osd:
360 osd journal size: 7500
361 filestore xattr use omap: true
362 mon:
363 mon debug dump transactions: false
364 keyring:
365 cinder:
366 key: 00000000000000000000000000000000000000==
367 glance:
368 key: 00000000000000000000000000000000000000==
369
370
371Ceph gateway
372------------
373
374Rados gateway with keystone v2 auth backend
375
376.. code-block:: yaml
377
378 ceph:
379 radosgw:
380 enabled: true
381 hostname: gw.ceph.lab
382 bind:
383 address: 10.10.10.1
384 port: 8080
385 identity:
386 engine: keystone
387 api_version: 2
388 host: 10.10.10.100
389 port: 5000
390 user: admin
391 password: password
392 tenant: admin
393
394Rados gateway with keystone v3 auth backend
395
396.. code-block:: yaml
397
398 ceph:
399 radosgw:
400 enabled: true
401 hostname: gw.ceph.lab
402 bind:
403 address: 10.10.10.1
404 port: 8080
405 identity:
406 engine: keystone
407 api_version: 3
408 host: 10.10.10.100
409 port: 5000
410 user: admin
411 password: password
412 project: admin
413 domain: default
414
415
416Ceph setup role
417---------------
418
419Replicated ceph storage pool
420
421.. code-block:: yaml
422
423 ceph:
424 setup:
425 pool:
426 replicated_pool:
427 pg_num: 256
428 pgp_num: 256
429 type: replicated
Jiri Broulik97af8ab2017-10-12 14:32:51 +0200430 crush_rule: sata
431 application: rbd
Ondrej Smola81d1a192017-08-17 11:13:10 +0200432
Jiri Broulikeaf41472017-10-18 09:56:33 +0200433 .. note:: For Kraken and earlier releases please specify crush_rule as a ruleset number.
434 For Kraken and earlier releases application param is not needed.
435
Ondrej Smola81d1a192017-08-17 11:13:10 +0200436Erasure ceph storage pool
437
438.. code-block:: yaml
439
440 ceph:
441 setup:
442 pool:
443 erasure_pool:
444 pg_num: 256
445 pgp_num: 256
446 type: erasure
Jiri Broulik97af8ab2017-10-12 14:32:51 +0200447 crush_rule: ssd
448 application: rbd
Ondrej Smola81d1a192017-08-17 11:13:10 +0200449
Jiri Broulik97af8ab2017-10-12 14:32:51 +0200450Generate CRUSH map - Recommended way
451-----------------------------------
Tomáš Kukrál363d37d2017-08-17 13:40:20 +0200452
Jiri Broulik97af8ab2017-10-12 14:32:51 +0200453It is required to define the `type` for crush buckets and these types must start with `root` (top) and end with `host`. OSD daemons will be assigned to hosts according to it's hostname. Weight of the buckets will be calculated according to weight of it's children.
454
455If the pools that are in use have size of 3 it is best to have 3 children of a specific type in the root CRUSH tree to replicate objects across (Specified in rule steps by 'type region').
Tomáš Kukrál363d37d2017-08-17 13:40:20 +0200456
457.. code-block:: yaml
458
Jiri Broulik97af8ab2017-10-12 14:32:51 +0200459 ceph:
460 setup:
461 crush:
462 enabled: True
463 tunables:
464 choose_total_tries: 50
465 choose_local_tries: 0
466 choose_local_fallback_tries: 0
467 chooseleaf_descend_once: 1
468 chooseleaf_vary_r: 1
469 chooseleaf_stable: 1
470 straw_calc_version: 1
471 allowed_bucket_algs: 54
472 type:
473 - root
474 - region
475 - rack
476 - host
Jiri Broulikeaf41472017-10-18 09:56:33 +0200477 - osd
Jiri Broulik97af8ab2017-10-12 14:32:51 +0200478 root:
479 - name: root-ssd
480 - name: root-sata
481 region:
482 - name: eu-1
483 parent: root-sata
484 - name: eu-2
485 parent: root-sata
486 - name: eu-3
487 parent: root-ssd
488 - name: us-1
489 parent: root-sata
490 rack:
491 - name: rack01
492 parent: eu-1
493 - name: rack02
494 parent: eu-2
495 - name: rack03
496 parent: us-1
497 rule:
498 sata:
499 ruleset: 0
500 type: replicated
501 min_size: 1
502 max_size: 10
503 steps:
504 - take take root-ssd
505 - chooseleaf firstn 0 type region
506 - emit
507 ssd:
508 ruleset: 1
509 type: replicated
510 min_size: 1
511 max_size: 10
512 steps:
513 - take take root-sata
514 - chooseleaf firstn 0 type region
515 - emit
516
517
518Generate CRUSH map - Alternative way
519------------------------------------
520
521It's necessary to create per OSD pillar.
522
523.. code-block:: yaml
524
525 ceph:
526 osd:
527 crush:
528 - type: root
529 name: root1
530 - type: region
531 name: eu-1
532 - type: rack
533 name: rack01
534 - type: host
535 name: osd001
536
537
538Apply CRUSH map
539---------------
540
541Before you apply CRUSH map please make sure that settings in generated file in /etc/ceph/crushmap are correct.
542
543.. code-block:: yaml
544
545 ceph:
546 setup:
547 crush:
548 enforce: true
549 pool:
550 images:
551 crush_rule: sata
552 application: rbd
553 volumes:
554 crush_rule: sata
555 application: rbd
556 vms:
557 crush_rule: ssd
558 application: rbd
559
Jiri Broulikeaf41472017-10-18 09:56:33 +0200560 .. note:: For Kraken and earlier releases please specify crush_rule as a ruleset number.
561 For Kraken and earlier releases application param is not needed.
562
Jiri Broulik97af8ab2017-10-12 14:32:51 +0200563
564Persist CRUSH map
565--------------------
566
567After the CRUSH map is applied to Ceph it's recommended to persist the same settings even after OSD reboots.
568
569.. code-block:: yaml
570
571 ceph:
572 osd:
573 crush_update: false
574
Ondrej Smola81d1a192017-08-17 11:13:10 +0200575
576Ceph monitoring
577---------------
578
579Collect general cluster metrics
580
581.. code-block:: yaml
582
583 ceph:
Simon Pasquierf8e6f9e2017-07-03 10:15:20 +0200584 monitoring:
585 cluster_stats:
586 enabled: true
587 ceph_user: monitoring
588
Ondrej Smola81d1a192017-08-17 11:13:10 +0200589Collect metrics from monitor and OSD services
Simon Pasquierf8e6f9e2017-07-03 10:15:20 +0200590
591.. code-block:: yaml
592
593 ceph:
594 monitoring:
595 node_stats:
596 enabled: true
597
598
Ondrej Smola81d1a192017-08-17 11:13:10 +0200599More information
600================
jpavlik8425d362015-06-09 15:23:27 +0200601
602* https://github.com/cloud-ee/ceph-salt-formula
603* http://ceph.com/ceph-storage/
jan kaufman4f7757b2015-06-12 10:49:00 +0200604* http://ceph.com/docs/master/start/intro/
Filip Pytloun32841d72017-02-02 13:02:03 +0100605
Ondrej Smola81d1a192017-08-17 11:13:10 +0200606
607Documentation and bugs
Filip Pytloun32841d72017-02-02 13:02:03 +0100608======================
609
610To learn how to install and update salt-formulas, consult the documentation
611available online at:
612
613 http://salt-formulas.readthedocs.io/
614
615In the unfortunate event that bugs are discovered, they should be reported to
616the appropriate issue tracker. Use Github issue tracker for specific salt
617formula:
618
619 https://github.com/salt-formulas/salt-formula-ceph/issues
620
621For feature requests, bug reports or blueprints affecting entire ecosystem,
622use Launchpad salt-formulas project:
623
624 https://launchpad.net/salt-formulas
625
626You can also join salt-formulas-users team and subscribe to mailing list:
627
628 https://launchpad.net/~salt-formulas-users
629
630Developers wishing to work on the salt-formulas projects should always base
631their work on master branch and submit pull request against specific formula.
632
633 https://github.com/salt-formulas/salt-formula-ceph
634
635Any questions or feedback is always welcome so feel free to join our IRC
636channel:
637
638 #salt-formulas @ irc.freenode.net