blob: 6a62baf8a8ea0376ab25b1ebd34d7ed4dd0e414d [file] [log] [blame]
jpavlik8425d362015-06-09 15:23:27 +02001
Ondrej Smola81d1a192017-08-17 11:13:10 +02002============
3Ceph formula
4============
jpavlik8425d362015-06-09 15:23:27 +02005
Ondrej Smola81d1a192017-08-17 11:13:10 +02006Ceph provides extraordinary data storage scalability. Thousands of client
7hosts or KVMs accessing petabytes to exabytes of data. Each one of your
8applications can use the object, block or file system interfaces to the same
9RADOS cluster simultaneously, which means your Ceph storage system serves as a
10flexible foundation for all of your data storage needs.
jpavlik8425d362015-06-09 15:23:27 +020011
Ondrej Smola81d1a192017-08-17 11:13:10 +020012Use salt-formula-linux for initial disk partitioning.
jpavlik8425d362015-06-09 15:23:27 +020013
14
Tomáš Kukráld2b82972017-08-29 12:45:45 +020015Daemons
16--------
17
18Ceph uses several daemons to handle data and cluster state. Each daemon type requires different computing capacity and hardware optimization.
19
20These daemons are currently supported by formula:
21
22* MON (`ceph.mon`)
23* OSD (`ceph.osd`)
24* RGW (`ceph.radosgw`)
25
26
27Architecture decisions
28-----------------------
29
30Please refer to upstream achritecture documents before designing your cluster. Solid understanding of Ceph principles is essential for making architecture decisions described bellow.
31http://docs.ceph.com/docs/master/architecture/
32
33* Ceph version
34
35There is 3 or 4 stable releases every year and many of nighty/dev release. You should decide which version will be used since the only stable releases are recommended for production. Some of the releases are marked LTS (Long Term Stable) and these releases receive bugfixed for longer period - usually until next LTS version is released.
36
37* Number of MON daemons
38
39Use 1 MON daemon for testing, 3 MONs for smaller production clusters and 5 MONs for very large production cluster. There is no need to have more than 5 MONs in normal environment because there isn't any significant benefit in running more than 5 MONs. Ceph require MONS to form quorum so you need to heve more than 50% of the MONs up and running to have fully operational cluster. Every I/O operation will stop once less than 50% MONs is availabe because they can't form quorum.
40
41* Number of PGs
42
43Placement groups are providing mappping between stored data and OSDs. It is necessary to calculate number of PGs because there should be stored decent amount of PGs on each OSD. Please keep in mind *decreasing number of PGs* isn't possible and *increading* can affect cluster performance.
44
45http://docs.ceph.com/docs/master/rados/operations/placement-groups/
46http://ceph.com/pgcalc/
47
48* Daemon colocation
49
50It is recommended to dedicate nodes for MONs and RWG since colocation can have and influence on cluster operations. Howerver, small clusters can be running MONs on OSD node but it is critical to have enough of resources for MON daemons because they are the most important part of the cluster.
51
52Installing RGW on node with other daemons isn't recommended because RGW daemon usually require a lot of bandwith and it harm cluster health.
53
54* Journal location
55
56There are two way to setup journal:
57 * **Colocated** journal is located (usually at the beginning) on the same disk as partition for the data. This setup is easier for installation and it doesn't require any other disk to be used. However, colocated setup is significantly slower than dedicated)
58 * **Dedicate** journal is placed on different disk than data. This setup can deliver much higher performance than colocated but it require to have more disks in servers. Journal drives should be carefully selected because high I/O and durability is required.
59
60* Store type (Bluestore/Filestore)
61
62Recent version of Ceph support Bluestore as storage backend and backend should be used if available.
63
64http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
65
66* Cluster and public network
67
68Ceph cluster is accessed using network and thus you need to have decend capacity to handle all the client. There are two networks required for cluster: **public** network and cluster network. Public network is used for client connections and MONs and OSDs are listening on this network. Second network ic called **cluster** networks and this network is used for communication between OSDs.
69
70Both networks should have dedicated interfaces, bonding interfaces and dedicating vlans on bonded interfaces isn't allowed. Good practise is dedicate more throughput for the cluster network because cluster traffic is more important than client traffic.
71
72* Pool parameters (size, min_size, type)
73
74You should setup each pool according to it's expected usage, at least `min_size` and `size` and pool type should be considered.
75
76* Cluster monitoring
77
78* Hardware
79
80Please refer to upstream hardware recommendation guide for general information about hardware.
81
82Ceph servers are required to fulfil special requirements becauce load generated by Ceph can be diametrically opposed to common load.
83
84http://docs.ceph.com/docs/master/start/hardware-recommendations/
85
86
87Basic management commands
88------------------------------
89
90Cluster
91********
92
93- :code:`ceph health` - check if cluster is healthy (:code:`ceph health detail` can provide more information)
94
95
96.. code-block:: bash
97
98 root@c-01:~# ceph health
99 HEALTH_OK
100
101- :code:`ceph status` - shows basic information about cluster
102
103
104.. code-block:: bash
105
106 root@c-01:~# ceph status
107 cluster e2dc51ae-c5e4-48f0-afc1-9e9e97dfd650
108 health HEALTH_OK
109 monmap e1: 3 mons at {1=192.168.31.201:6789/0,2=192.168.31.202:6789/0,3=192.168.31.203:6789/0}
110 election epoch 38, quorum 0,1,2 1,2,3
111 osdmap e226: 6 osds: 6 up, 6 in
112 pgmap v27916: 400 pgs, 2 pools, 21233 MB data, 5315 objects
113 121 GB used, 10924 GB / 11058 GB avail
114 400 active+clean
115 client io 481 kB/s rd, 132 kB/s wr, 185 op/
116
117MON
118****
119
120http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
121
122OSD
123****
124
125http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
126
127- :code:`ceph osd tree` - show all OSDs and it's state
128
129.. code-block:: bash
130
131 root@c-01:~# ceph osd tree
132 ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
133 -4 0 host c-04
134 -1 10.79993 root default
135 -2 3.59998 host c-01
136 0 1.79999 osd.0 up 1.00000 1.00000
137 1 1.79999 osd.1 up 1.00000 1.00000
138 -3 3.59998 host c-02
139 2 1.79999 osd.2 up 1.00000 1.00000
140 3 1.79999 osd.3 up 1.00000 1.00000
141 -5 3.59998 host c-03
142 4 1.79999 osd.4 up 1.00000 1.00000
143 5 1.79999 osd.5 up 1.00000 1.00000
144
145- :code:`ceph osd pools ls` - list of pool
146
147.. code-block:: bash
148
149 root@c-01:~# ceph osd lspools
150 0 rbd,1 test
151
152PG
153***
154
155http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg
156
157- :code:`ceph pg ls` - list placement groups
158
159.. code-block:: bash
160
161 root@c-01:~# ceph pg ls | head -n 4
162 pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
163 0.0 11 0 0 0 0 46137344 3044 3044 active+clean 2015-07-02 10:12:40.603692 226'10652 226:1798 [4,2,0] 4 [4,2,0] 4 0'0 2015-07-01 18:38:33.126953 0'0 2015-07-01 18:17:01.904194
164 0.1 7 0 0 0 0 25165936 3026 3026 active+clean 2015-07-02 10:12:40.585833 226'5808 226:1070 [2,4,1] 2 [2,4,1] 2 0'0 2015-07-01 18:38:32.352721 0'0 2015-07-01 18:17:01.904198
165 0.2 18 0 0 0 0 75497472 3039 3039 active+clean 2015-07-02 10:12:39.569630 226'17447 226:3213 [3,1,5] 3 [3,1,5] 3 0'0 2015-07-01 18:38:34.308228 0'0 2015-07-01 18:17:01.904199
166
167- :code:`ceph pg map 1.1` - show mapping between PG and OSD
168
169.. code-block:: bash
170
171 root@c-01:~# ceph pg map 1.1
172 osdmap e226 pg 1.1 (1.1) -> up [5,1,2] acting [5,1,2]
173
174
175
jpavlik8425d362015-06-09 15:23:27 +0200176Sample pillars
177==============
178
Ondrej Smola81d1a192017-08-17 11:13:10 +0200179Common metadata for all nodes/roles
jpavlik8425d362015-06-09 15:23:27 +0200180
181.. code-block:: yaml
182
183 ceph:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200184 common:
Jiri Broulikd5729042017-09-19 20:07:22 +0200185 version: luminous
jpavlik8425d362015-06-09 15:23:27 +0200186 config:
187 global:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200188 param1: value1
189 param2: value1
190 param3: value1
191 pool_section:
192 param1: value2
193 param2: value2
194 param3: value2
195 fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
196 members:
197 - name: cmn01
198 host: 10.0.0.1
199 - name: cmn02
200 host: 10.0.0.2
201 - name: cmn03
202 host: 10.0.0.3
jpavlik8425d362015-06-09 15:23:27 +0200203 keyring:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200204 admin:
205 key: AQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
206 caps:
207 mds: "allow *"
208 mgr: "allow *"
209 mon: "allow *"
210 osd: "allow *"
Jiri Broulikd5729042017-09-19 20:07:22 +0200211 bootstrap-osd:
212 key: BQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
213 caps:
214 mon: "allow profile bootstrap-osd"
215
jpavlik8425d362015-06-09 15:23:27 +0200216
Ondrej Smola81d1a192017-08-17 11:13:10 +0200217Optional definition for cluster and public networks. Cluster network is used
218for replication. Public network for front-end communication.
jpavlik8425d362015-06-09 15:23:27 +0200219
220.. code-block:: yaml
221
222 ceph:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200223 common:
Jiri Broulikd5729042017-09-19 20:07:22 +0200224 version: luminous
Ondrej Smola81d1a192017-08-17 11:13:10 +0200225 fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
226 ....
227 public_network: 10.0.0.0/24, 10.1.0.0/24
228 cluster_network: 10.10.0.0/24, 10.11.0.0/24
229
230
231Ceph mon (control) roles
232------------------------
233
234Monitors: A Ceph Monitor maintains maps of the cluster state, including the
235monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map.
236Ceph maintains a history (called an “epoch”) of each state change in the Ceph
237Monitors, Ceph OSD Daemons, and PGs.
238
239.. code-block:: yaml
240
241 ceph:
242 common:
243 config:
244 mon:
245 key: value
jpavlik8425d362015-06-09 15:23:27 +0200246 mon:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200247 enabled: true
jpavlik8425d362015-06-09 15:23:27 +0200248 keyring:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200249 mon:
250 key: AQAnQIhZ6in5KxAAdf467upoRMWFcVg5pbh1yg==
251 caps:
252 mon: "allow *"
253 admin:
254 key: AQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
255 caps:
256 mds: "allow *"
257 mgr: "allow *"
258 mon: "allow *"
259 osd: "allow *"
jpavlik8425d362015-06-09 15:23:27 +0200260
Ondrej Smola91c83162017-09-12 16:40:02 +0200261Ceph mgr roles
262------------------------
263
264The Ceph Manager daemon (ceph-mgr) runs alongside monitor daemons, to provide additional monitoring and interfaces to external monitoring and management systems. Since the 12.x (luminous) Ceph release, the ceph-mgr daemon is required for normal operations. The ceph-mgr daemon is an optional component in the 11.x (kraken) Ceph release.
265
266By default, the manager daemon requires no additional configuration, beyond ensuring it is running. If there is no mgr daemon running, you will see a health warning to that effect, and some of the other information in the output of ceph status will be missing or stale until a mgr is started.
267
268
269.. code-block:: yaml
270
271 ceph:
272 mgr:
273 enabled: true
274 dashboard:
275 enabled: true
276 host: 10.103.255.252
277 port: 7000
278
Ondrej Smola81d1a192017-08-17 11:13:10 +0200279
280Ceph OSD (storage) roles
281------------------------
jpavlik8425d362015-06-09 15:23:27 +0200282
283.. code-block:: yaml
284
285 ceph:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200286 common:
jpavlik8425d362015-06-09 15:23:27 +0200287 config:
jpavlik8425d362015-06-09 15:23:27 +0200288 osd:
Ondrej Smola81d1a192017-08-17 11:13:10 +0200289 key: value
290 osd:
291 enabled: true
Jiri Broulikd5729042017-09-19 20:07:22 +0200292 ceph_host_id: '39'
293 journal_size: 20480
294 bluestore_block_db_size: 1073741824 (1G)
295 bluestore_block_wal_size: 1073741824 (1G)
296 bluestore_block_size: 807374182400 (800G)
297 backend:
298 filestore:
299 disks:
300 - dev: /dev/sdm
301 enabled: false
302 rule: hdd
303 journal: /dev/ssd
304 fs_type: xfs
305 class: bestssd
306 weight: 1.5
307 - dev: /dev/sdl
308 rule: hdd
309 journal: /dev/ssd
310 fs_type: xfs
311 class: bestssd
312 weight: 1.5
313 bluestore:
314 disks:
315 - dev: /dev/sdb
316 - dev: /dev/sdc
317 block_db: /dev/ssd
318 block_wal: /dev/ssd
319 - dev: /dev/sdd
320 enabled: false
jpavlik8425d362015-06-09 15:23:27 +0200321
Ondrej Smola81d1a192017-08-17 11:13:10 +0200322
323Ceph client roles
324-----------------
325
326Simple ceph client service
Simon Pasquierf8e6f9e2017-07-03 10:15:20 +0200327
328.. code-block:: yaml
329
330 ceph:
331 client:
332 config:
333 global:
334 mon initial members: ceph1,ceph2,ceph3
335 mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789
336 keyring:
337 monitoring:
338 key: 00000000000000000000000000000000000000==
Ondrej Smola81d1a192017-08-17 11:13:10 +0200339
340At OpenStack control settings are usually located at cinder-volume or glance-
341registry services.
342
343.. code-block:: yaml
344
345 ceph:
346 client:
347 config:
348 global:
349 fsid: 00000000-0000-0000-0000-000000000000
350 mon initial members: ceph1,ceph2,ceph3
351 mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789
352 osd_fs_mkfs_arguments_xfs:
353 osd_fs_mount_options_xfs: rw,noatime
354 network public: 10.0.0.0/24
355 network cluster: 10.0.0.0/24
356 osd_fs_type: xfs
357 osd:
358 osd journal size: 7500
359 filestore xattr use omap: true
360 mon:
361 mon debug dump transactions: false
362 keyring:
363 cinder:
364 key: 00000000000000000000000000000000000000==
365 glance:
366 key: 00000000000000000000000000000000000000==
367
368
369Ceph gateway
370------------
371
372Rados gateway with keystone v2 auth backend
373
374.. code-block:: yaml
375
376 ceph:
377 radosgw:
378 enabled: true
379 hostname: gw.ceph.lab
380 bind:
381 address: 10.10.10.1
382 port: 8080
383 identity:
384 engine: keystone
385 api_version: 2
386 host: 10.10.10.100
387 port: 5000
388 user: admin
389 password: password
390 tenant: admin
391
392Rados gateway with keystone v3 auth backend
393
394.. code-block:: yaml
395
396 ceph:
397 radosgw:
398 enabled: true
399 hostname: gw.ceph.lab
400 bind:
401 address: 10.10.10.1
402 port: 8080
403 identity:
404 engine: keystone
405 api_version: 3
406 host: 10.10.10.100
407 port: 5000
408 user: admin
409 password: password
410 project: admin
411 domain: default
412
413
414Ceph setup role
415---------------
416
417Replicated ceph storage pool
418
419.. code-block:: yaml
420
421 ceph:
422 setup:
423 pool:
424 replicated_pool:
425 pg_num: 256
426 pgp_num: 256
427 type: replicated
428 crush_ruleset_name: 0
429
430Erasure ceph storage pool
431
432.. code-block:: yaml
433
434 ceph:
435 setup:
436 pool:
437 erasure_pool:
438 pg_num: 256
439 pgp_num: 256
440 type: erasure
441 crush_ruleset_name: 0
Ondrej Smola81d1a192017-08-17 11:13:10 +0200442
Tomáš Kukrál363d37d2017-08-17 13:40:20 +0200443Generate CRUSH map
Tomáš Kukráld2b82972017-08-29 12:45:45 +0200444--------------------
Tomáš Kukrál363d37d2017-08-17 13:40:20 +0200445
446It is required to define the `type` for crush buckets and these types must start with `root` (top) and end with `host`. OSD daemons will be assigned to hosts according to it's hostname. Weight of the buckets will be calculated according to weight of it's childen.
447
448.. code-block:: yaml
449
Tomáš Kukrál9ddb95b2017-08-17 14:18:51 +0200450 ceph:
451 setup:
452 crush:
453 enabled: True
454 tunables:
455 choose_total_tries: 50
456 type:
457 - root
458 - region
459 - rack
460 - host
461 root:
462 - name: root1
463 - name: root2
464 region:
465 - name: eu-1
466 parent: root1
467 - name: eu-2
468 parent: root1
469 - name: us-1
470 parent: root2
471 rack:
472 - name: rack01
473 parent: eu-1
474 - name: rack02
475 parent: eu-2
476 - name: rack03
477 parent: us-1
478 rule:
479 sata:
480 ruleset: 0
481 type: replicated
482 min_size: 1
483 max_size: 10
484 steps:
485 - take crushroot.performanceblock.satahss.1
486 - choseleaf firstn 0 type failure_domain
487 - emit
Ondrej Smola81d1a192017-08-17 11:13:10 +0200488
489Ceph monitoring
490---------------
491
492Collect general cluster metrics
493
494.. code-block:: yaml
495
496 ceph:
Simon Pasquierf8e6f9e2017-07-03 10:15:20 +0200497 monitoring:
498 cluster_stats:
499 enabled: true
500 ceph_user: monitoring
501
Ondrej Smola81d1a192017-08-17 11:13:10 +0200502Collect metrics from monitor and OSD services
Simon Pasquierf8e6f9e2017-07-03 10:15:20 +0200503
504.. code-block:: yaml
505
506 ceph:
507 monitoring:
508 node_stats:
509 enabled: true
510
511
Ondrej Smola81d1a192017-08-17 11:13:10 +0200512More information
513================
jpavlik8425d362015-06-09 15:23:27 +0200514
515* https://github.com/cloud-ee/ceph-salt-formula
516* http://ceph.com/ceph-storage/
jan kaufman4f7757b2015-06-12 10:49:00 +0200517* http://ceph.com/docs/master/start/intro/
Filip Pytloun32841d72017-02-02 13:02:03 +0100518
Ondrej Smola81d1a192017-08-17 11:13:10 +0200519
520Documentation and bugs
Filip Pytloun32841d72017-02-02 13:02:03 +0100521======================
522
523To learn how to install and update salt-formulas, consult the documentation
524available online at:
525
526 http://salt-formulas.readthedocs.io/
527
528In the unfortunate event that bugs are discovered, they should be reported to
529the appropriate issue tracker. Use Github issue tracker for specific salt
530formula:
531
532 https://github.com/salt-formulas/salt-formula-ceph/issues
533
534For feature requests, bug reports or blueprints affecting entire ecosystem,
535use Launchpad salt-formulas project:
536
537 https://launchpad.net/salt-formulas
538
539You can also join salt-formulas-users team and subscribe to mailing list:
540
541 https://launchpad.net/~salt-formulas-users
542
543Developers wishing to work on the salt-formulas projects should always base
544their work on master branch and submit pull request against specific formula.
545
546 https://github.com/salt-formulas/salt-formula-ceph
547
548Any questions or feedback is always welcome so feel free to join our IRC
549channel:
550
551 #salt-formulas @ irc.freenode.net