jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 1 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 2 | ============ |
| 3 | Ceph formula |
| 4 | ============ |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 5 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 6 | Ceph provides extraordinary data storage scalability. Thousands of client |
| 7 | hosts or KVMs accessing petabytes to exabytes of data. Each one of your |
| 8 | applications can use the object, block or file system interfaces to the same |
| 9 | RADOS cluster simultaneously, which means your Ceph storage system serves as a |
| 10 | flexible foundation for all of your data storage needs. |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 11 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 12 | Use salt-formula-linux for initial disk partitioning. |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 13 | |
| 14 | |
Tomáš Kukrál | d2b8297 | 2017-08-29 12:45:45 +0200 | [diff] [blame] | 15 | Daemons |
| 16 | -------- |
| 17 | |
| 18 | Ceph uses several daemons to handle data and cluster state. Each daemon type requires different computing capacity and hardware optimization. |
| 19 | |
| 20 | These daemons are currently supported by formula: |
| 21 | |
| 22 | * MON (`ceph.mon`) |
| 23 | * OSD (`ceph.osd`) |
| 24 | * RGW (`ceph.radosgw`) |
| 25 | |
| 26 | |
| 27 | Architecture decisions |
| 28 | ----------------------- |
| 29 | |
| 30 | Please refer to upstream achritecture documents before designing your cluster. Solid understanding of Ceph principles is essential for making architecture decisions described bellow. |
| 31 | http://docs.ceph.com/docs/master/architecture/ |
| 32 | |
| 33 | * Ceph version |
| 34 | |
| 35 | There is 3 or 4 stable releases every year and many of nighty/dev release. You should decide which version will be used since the only stable releases are recommended for production. Some of the releases are marked LTS (Long Term Stable) and these releases receive bugfixed for longer period - usually until next LTS version is released. |
| 36 | |
| 37 | * Number of MON daemons |
| 38 | |
| 39 | Use 1 MON daemon for testing, 3 MONs for smaller production clusters and 5 MONs for very large production cluster. There is no need to have more than 5 MONs in normal environment because there isn't any significant benefit in running more than 5 MONs. Ceph require MONS to form quorum so you need to heve more than 50% of the MONs up and running to have fully operational cluster. Every I/O operation will stop once less than 50% MONs is availabe because they can't form quorum. |
| 40 | |
| 41 | * Number of PGs |
| 42 | |
| 43 | Placement groups are providing mappping between stored data and OSDs. It is necessary to calculate number of PGs because there should be stored decent amount of PGs on each OSD. Please keep in mind *decreasing number of PGs* isn't possible and *increading* can affect cluster performance. |
| 44 | |
| 45 | http://docs.ceph.com/docs/master/rados/operations/placement-groups/ |
| 46 | http://ceph.com/pgcalc/ |
| 47 | |
| 48 | * Daemon colocation |
| 49 | |
| 50 | It is recommended to dedicate nodes for MONs and RWG since colocation can have and influence on cluster operations. Howerver, small clusters can be running MONs on OSD node but it is critical to have enough of resources for MON daemons because they are the most important part of the cluster. |
| 51 | |
| 52 | Installing RGW on node with other daemons isn't recommended because RGW daemon usually require a lot of bandwith and it harm cluster health. |
| 53 | |
Tomáš Kukrál | d2b8297 | 2017-08-29 12:45:45 +0200 | [diff] [blame] | 54 | * Store type (Bluestore/Filestore) |
| 55 | |
| 56 | Recent version of Ceph support Bluestore as storage backend and backend should be used if available. |
| 57 | |
| 58 | http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ |
| 59 | |
Jiri Broulik | cc0d775 | 2017-11-18 18:58:21 +0100 | [diff] [blame] | 60 | * Block.db location for Bluestore |
| 61 | |
| 62 | There are two ways to setup block.db: |
| 63 | * **Colocated** block.db partition is created on the same disk as partition for the data. This setup is easier for installation and it doesn't require any other disk to be used. However, colocated setup is significantly slower than dedicated) |
| 64 | * **Dedicate** block.db is placed on different disk than data (or into partition). This setup can deliver much higher performance than colocated but it require to have more disks in servers. Block.db drives should be carefully selected because high I/O and durability is required. |
| 65 | |
| 66 | * Block.wal location for Bluestore |
| 67 | |
| 68 | There are two ways to setup block.wal - stores just the internal journal (write-ahead log): |
| 69 | * **Colocated** block.wal uses free space of the block.db device. |
| 70 | * **Dedicate** block.wal is placed on different disk than data (better put into partition as the size can be small) and possibly block.db device. This setup can deliver much higher performance than colocated but it require to have more disks in servers. Block.wal drives should be carefully selected because high I/O and durability is required. |
| 71 | |
| 72 | * Journal location for Filestore |
| 73 | |
| 74 | There are two ways to setup journal: |
| 75 | * **Colocated** journal is created on the same disk as partition for the data. This setup is easier for installation and it doesn't require any other disk to be used. However, colocated setup is significantly slower than dedicated) |
| 76 | * **Dedicate** journal is placed on different disk than data (or into partition). This setup can deliver much higher performance than colocated but it require to have more disks in servers. Journal drives should be carefully selected because high I/O and durability is required. |
| 77 | |
Tomáš Kukrál | d2b8297 | 2017-08-29 12:45:45 +0200 | [diff] [blame] | 78 | * Cluster and public network |
| 79 | |
Mateusz Los | 4dd8c4f | 2017-12-01 09:53:02 +0100 | [diff] [blame] | 80 | Ceph cluster is accessed using network and thus you need to have decend capacity to handle all the client. There are two networks required for cluster: **public** network and cluster network. Public network is used for client connections and MONs and OSDs are listening on this network. Second network ic called **cluster** networks and this network is used for communication between OSDs. |
Tomáš Kukrál | d2b8297 | 2017-08-29 12:45:45 +0200 | [diff] [blame] | 81 | |
| 82 | Both networks should have dedicated interfaces, bonding interfaces and dedicating vlans on bonded interfaces isn't allowed. Good practise is dedicate more throughput for the cluster network because cluster traffic is more important than client traffic. |
| 83 | |
| 84 | * Pool parameters (size, min_size, type) |
| 85 | |
| 86 | You should setup each pool according to it's expected usage, at least `min_size` and `size` and pool type should be considered. |
| 87 | |
| 88 | * Cluster monitoring |
| 89 | |
| 90 | * Hardware |
| 91 | |
| 92 | Please refer to upstream hardware recommendation guide for general information about hardware. |
| 93 | |
| 94 | Ceph servers are required to fulfil special requirements becauce load generated by Ceph can be diametrically opposed to common load. |
| 95 | |
| 96 | http://docs.ceph.com/docs/master/start/hardware-recommendations/ |
| 97 | |
| 98 | |
| 99 | Basic management commands |
| 100 | ------------------------------ |
| 101 | |
| 102 | Cluster |
| 103 | ******** |
| 104 | |
| 105 | - :code:`ceph health` - check if cluster is healthy (:code:`ceph health detail` can provide more information) |
| 106 | |
| 107 | |
| 108 | .. code-block:: bash |
| 109 | |
| 110 | root@c-01:~# ceph health |
| 111 | HEALTH_OK |
| 112 | |
| 113 | - :code:`ceph status` - shows basic information about cluster |
| 114 | |
| 115 | |
| 116 | .. code-block:: bash |
| 117 | |
| 118 | root@c-01:~# ceph status |
| 119 | cluster e2dc51ae-c5e4-48f0-afc1-9e9e97dfd650 |
| 120 | health HEALTH_OK |
| 121 | monmap e1: 3 mons at {1=192.168.31.201:6789/0,2=192.168.31.202:6789/0,3=192.168.31.203:6789/0} |
| 122 | election epoch 38, quorum 0,1,2 1,2,3 |
| 123 | osdmap e226: 6 osds: 6 up, 6 in |
| 124 | pgmap v27916: 400 pgs, 2 pools, 21233 MB data, 5315 objects |
| 125 | 121 GB used, 10924 GB / 11058 GB avail |
| 126 | 400 active+clean |
| 127 | client io 481 kB/s rd, 132 kB/s wr, 185 op/ |
| 128 | |
| 129 | MON |
| 130 | **** |
| 131 | |
| 132 | http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/ |
| 133 | |
| 134 | OSD |
| 135 | **** |
| 136 | |
| 137 | http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ |
| 138 | |
| 139 | - :code:`ceph osd tree` - show all OSDs and it's state |
| 140 | |
| 141 | .. code-block:: bash |
| 142 | |
| 143 | root@c-01:~# ceph osd tree |
| 144 | ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY |
| 145 | -4 0 host c-04 |
| 146 | -1 10.79993 root default |
| 147 | -2 3.59998 host c-01 |
| 148 | 0 1.79999 osd.0 up 1.00000 1.00000 |
| 149 | 1 1.79999 osd.1 up 1.00000 1.00000 |
| 150 | -3 3.59998 host c-02 |
| 151 | 2 1.79999 osd.2 up 1.00000 1.00000 |
| 152 | 3 1.79999 osd.3 up 1.00000 1.00000 |
| 153 | -5 3.59998 host c-03 |
| 154 | 4 1.79999 osd.4 up 1.00000 1.00000 |
| 155 | 5 1.79999 osd.5 up 1.00000 1.00000 |
| 156 | |
| 157 | - :code:`ceph osd pools ls` - list of pool |
| 158 | |
| 159 | .. code-block:: bash |
| 160 | |
| 161 | root@c-01:~# ceph osd lspools |
| 162 | 0 rbd,1 test |
| 163 | |
| 164 | PG |
| 165 | *** |
| 166 | |
| 167 | http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg |
| 168 | |
| 169 | - :code:`ceph pg ls` - list placement groups |
| 170 | |
| 171 | .. code-block:: bash |
| 172 | |
| 173 | root@c-01:~# ceph pg ls | head -n 4 |
| 174 | pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp |
| 175 | 0.0 11 0 0 0 0 46137344 3044 3044 active+clean 2015-07-02 10:12:40.603692 226'10652 226:1798 [4,2,0] 4 [4,2,0] 4 0'0 2015-07-01 18:38:33.126953 0'0 2015-07-01 18:17:01.904194 |
| 176 | 0.1 7 0 0 0 0 25165936 3026 3026 active+clean 2015-07-02 10:12:40.585833 226'5808 226:1070 [2,4,1] 2 [2,4,1] 2 0'0 2015-07-01 18:38:32.352721 0'0 2015-07-01 18:17:01.904198 |
| 177 | 0.2 18 0 0 0 0 75497472 3039 3039 active+clean 2015-07-02 10:12:39.569630 226'17447 226:3213 [3,1,5] 3 [3,1,5] 3 0'0 2015-07-01 18:38:34.308228 0'0 2015-07-01 18:17:01.904199 |
| 178 | |
| 179 | - :code:`ceph pg map 1.1` - show mapping between PG and OSD |
| 180 | |
| 181 | .. code-block:: bash |
| 182 | |
| 183 | root@c-01:~# ceph pg map 1.1 |
| 184 | osdmap e226 pg 1.1 (1.1) -> up [5,1,2] acting [5,1,2] |
| 185 | |
| 186 | |
| 187 | |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 188 | Sample pillars |
| 189 | ============== |
| 190 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 191 | Common metadata for all nodes/roles |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 192 | |
| 193 | .. code-block:: yaml |
| 194 | |
| 195 | ceph: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 196 | common: |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 197 | version: luminous |
Jiri Broulik | 4255205 | 2018-02-15 15:23:29 +0100 | [diff] [blame] | 198 | cluster_name: ceph |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 199 | config: |
| 200 | global: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 201 | param1: value1 |
| 202 | param2: value1 |
| 203 | param3: value1 |
| 204 | pool_section: |
| 205 | param1: value2 |
| 206 | param2: value2 |
| 207 | param3: value2 |
| 208 | fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d |
| 209 | members: |
| 210 | - name: cmn01 |
| 211 | host: 10.0.0.1 |
| 212 | - name: cmn02 |
| 213 | host: 10.0.0.2 |
| 214 | - name: cmn03 |
| 215 | host: 10.0.0.3 |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 216 | keyring: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 217 | admin: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 218 | caps: |
| 219 | mds: "allow *" |
| 220 | mgr: "allow *" |
| 221 | mon: "allow *" |
| 222 | osd: "allow *" |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 223 | bootstrap-osd: |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 224 | caps: |
| 225 | mon: "allow profile bootstrap-osd" |
| 226 | |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 227 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 228 | Optional definition for cluster and public networks. Cluster network is used |
| 229 | for replication. Public network for front-end communication. |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 230 | |
| 231 | .. code-block:: yaml |
| 232 | |
| 233 | ceph: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 234 | common: |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 235 | version: luminous |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 236 | fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d |
| 237 | .... |
| 238 | public_network: 10.0.0.0/24, 10.1.0.0/24 |
| 239 | cluster_network: 10.10.0.0/24, 10.11.0.0/24 |
| 240 | |
| 241 | |
| 242 | Ceph mon (control) roles |
| 243 | ------------------------ |
| 244 | |
| 245 | Monitors: A Ceph Monitor maintains maps of the cluster state, including the |
| 246 | monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map. |
| 247 | Ceph maintains a history (called an “epoch”) of each state change in the Ceph |
| 248 | Monitors, Ceph OSD Daemons, and PGs. |
| 249 | |
| 250 | .. code-block:: yaml |
| 251 | |
| 252 | ceph: |
| 253 | common: |
| 254 | config: |
| 255 | mon: |
| 256 | key: value |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 257 | mon: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 258 | enabled: true |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 259 | keyring: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 260 | mon: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 261 | caps: |
| 262 | mon: "allow *" |
| 263 | admin: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 264 | caps: |
| 265 | mds: "allow *" |
| 266 | mgr: "allow *" |
| 267 | mon: "allow *" |
| 268 | osd: "allow *" |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 269 | |
Ondrej Smola | 91c8316 | 2017-09-12 16:40:02 +0200 | [diff] [blame] | 270 | Ceph mgr roles |
| 271 | ------------------------ |
| 272 | |
| 273 | The Ceph Manager daemon (ceph-mgr) runs alongside monitor daemons, to provide additional monitoring and interfaces to external monitoring and management systems. Since the 12.x (luminous) Ceph release, the ceph-mgr daemon is required for normal operations. The ceph-mgr daemon is an optional component in the 11.x (kraken) Ceph release. |
| 274 | |
| 275 | By default, the manager daemon requires no additional configuration, beyond ensuring it is running. If there is no mgr daemon running, you will see a health warning to that effect, and some of the other information in the output of ceph status will be missing or stale until a mgr is started. |
| 276 | |
| 277 | |
| 278 | .. code-block:: yaml |
| 279 | |
| 280 | ceph: |
| 281 | mgr: |
| 282 | enabled: true |
| 283 | dashboard: |
| 284 | enabled: true |
| 285 | host: 10.103.255.252 |
| 286 | port: 7000 |
| 287 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 288 | |
| 289 | Ceph OSD (storage) roles |
| 290 | ------------------------ |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 291 | |
| 292 | .. code-block:: yaml |
| 293 | |
| 294 | ceph: |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 295 | common: |
Jiri Broulik | ec62dec | 2017-10-10 13:45:15 +0200 | [diff] [blame] | 296 | version: luminous |
| 297 | fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d |
| 298 | public_network: 10.0.0.0/24, 10.1.0.0/24 |
| 299 | cluster_network: 10.10.0.0/24, 10.11.0.0/24 |
| 300 | keyring: |
| 301 | bootstrap-osd: |
| 302 | caps: |
| 303 | mon: "allow profile bootstrap-osd" |
| 304 | .... |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 305 | osd: |
| 306 | enabled: true |
Jiri Broulik | ec62dec | 2017-10-10 13:45:15 +0200 | [diff] [blame] | 307 | crush_parent: rack01 |
| 308 | journal_size: 20480 (20G) |
| 309 | bluestore_block_db_size: 10073741824 (10G) |
| 310 | bluestore_block_wal_size: 10073741824 (10G) |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 311 | bluestore_block_size: 807374182400 (800G) |
| 312 | backend: |
| 313 | filestore: |
| 314 | disks: |
| 315 | - dev: /dev/sdm |
| 316 | enabled: false |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 317 | journal: /dev/ssd |
Jiri Broulik | 8870b87 | 2018-01-24 18:04:25 +0100 | [diff] [blame] | 318 | journal_partition: 5 |
| 319 | data_partition: 6 |
| 320 | lockbox_partition: 7 |
| 321 | data_partition_size: 12000 (MB) |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 322 | class: bestssd |
Jiri Broulik | 8870b87 | 2018-01-24 18:04:25 +0100 | [diff] [blame] | 323 | weight: 1.666 |
Jiri Broulik | 58ff84b | 2017-11-21 14:23:51 +0100 | [diff] [blame] | 324 | dmcrypt: true |
Jiri Broulik | 8870b87 | 2018-01-24 18:04:25 +0100 | [diff] [blame] | 325 | journal_dmcrypt: false |
| 326 | - dev: /dev/sdf |
| 327 | journal: /dev/ssd |
| 328 | journal_dmcrypt: true |
| 329 | class: bestssd |
| 330 | weight: 1.666 |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 331 | - dev: /dev/sdl |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 332 | journal: /dev/ssd |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 333 | class: bestssd |
Jiri Broulik | 8870b87 | 2018-01-24 18:04:25 +0100 | [diff] [blame] | 334 | weight: 1.666 |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 335 | bluestore: |
| 336 | disks: |
| 337 | - dev: /dev/sdb |
Jiri Broulik | 8870b87 | 2018-01-24 18:04:25 +0100 | [diff] [blame] | 338 | - dev: /dev/sdf |
| 339 | block_db: /dev/ssd |
| 340 | block_wal: /dev/ssd |
| 341 | block_db_dmcrypt: true |
| 342 | block_wal_dmcrypt: true |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 343 | - dev: /dev/sdc |
| 344 | block_db: /dev/ssd |
| 345 | block_wal: /dev/ssd |
Jiri Broulik | 8870b87 | 2018-01-24 18:04:25 +0100 | [diff] [blame] | 346 | data_partition: 1 |
| 347 | block_partition: 2 |
| 348 | lockbox_partition: 5 |
| 349 | block_db_partition: 3 |
| 350 | block_wal_partition: 4 |
Jiri Broulik | c2be93b | 2017-10-03 14:20:00 +0200 | [diff] [blame] | 351 | class: ssd |
| 352 | weight: 1.666 |
Jiri Broulik | 58ff84b | 2017-11-21 14:23:51 +0100 | [diff] [blame] | 353 | dmcrypt: true |
Jiri Broulik | 8870b87 | 2018-01-24 18:04:25 +0100 | [diff] [blame] | 354 | block_db_dmcrypt: false |
| 355 | block_wal_dmcrypt: false |
Jiri Broulik | d572904 | 2017-09-19 20:07:22 +0200 | [diff] [blame] | 356 | - dev: /dev/sdd |
| 357 | enabled: false |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 358 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 359 | |
Mykyta Karpin | 37949ba | 2018-11-21 12:31:28 +0200 | [diff] [blame] | 360 | In case some custom block devices should be used (like loop devices for testing purpose), |
| 361 | it is needed to indicate proper partition prefix. |
| 362 | |
| 363 | .. code-block:: yaml |
| 364 | |
| 365 | ceph: |
| 366 | osd: |
| 367 | backend: |
| 368 | bluestore: |
| 369 | disks: |
| 370 | - dev: /dev/loop20 |
| 371 | block_db: /dev/loop21 |
| 372 | data_partition_prefix: 'p' |
| 373 | |
| 374 | |
Jiri Broulik | c2be93b | 2017-10-03 14:20:00 +0200 | [diff] [blame] | 375 | Ceph client roles - ...Deprecated - use ceph:common instead |
| 376 | -------------------------------------------------------- |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 377 | |
| 378 | Simple ceph client service |
Simon Pasquier | f8e6f9e | 2017-07-03 10:15:20 +0200 | [diff] [blame] | 379 | |
| 380 | .. code-block:: yaml |
| 381 | |
| 382 | ceph: |
| 383 | client: |
| 384 | config: |
| 385 | global: |
| 386 | mon initial members: ceph1,ceph2,ceph3 |
| 387 | mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789 |
| 388 | keyring: |
| 389 | monitoring: |
| 390 | key: 00000000000000000000000000000000000000== |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 391 | |
| 392 | At OpenStack control settings are usually located at cinder-volume or glance- |
| 393 | registry services. |
| 394 | |
| 395 | .. code-block:: yaml |
| 396 | |
| 397 | ceph: |
| 398 | client: |
| 399 | config: |
| 400 | global: |
| 401 | fsid: 00000000-0000-0000-0000-000000000000 |
| 402 | mon initial members: ceph1,ceph2,ceph3 |
| 403 | mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789 |
| 404 | osd_fs_mkfs_arguments_xfs: |
| 405 | osd_fs_mount_options_xfs: rw,noatime |
| 406 | network public: 10.0.0.0/24 |
| 407 | network cluster: 10.0.0.0/24 |
| 408 | osd_fs_type: xfs |
| 409 | osd: |
| 410 | osd journal size: 7500 |
| 411 | filestore xattr use omap: true |
| 412 | mon: |
| 413 | mon debug dump transactions: false |
| 414 | keyring: |
| 415 | cinder: |
| 416 | key: 00000000000000000000000000000000000000== |
| 417 | glance: |
| 418 | key: 00000000000000000000000000000000000000== |
| 419 | |
| 420 | |
| 421 | Ceph gateway |
| 422 | ------------ |
| 423 | |
| 424 | Rados gateway with keystone v2 auth backend |
| 425 | |
| 426 | .. code-block:: yaml |
| 427 | |
| 428 | ceph: |
| 429 | radosgw: |
| 430 | enabled: true |
| 431 | hostname: gw.ceph.lab |
| 432 | bind: |
| 433 | address: 10.10.10.1 |
| 434 | port: 8080 |
| 435 | identity: |
| 436 | engine: keystone |
| 437 | api_version: 2 |
| 438 | host: 10.10.10.100 |
| 439 | port: 5000 |
| 440 | user: admin |
| 441 | password: password |
| 442 | tenant: admin |
| 443 | |
| 444 | Rados gateway with keystone v3 auth backend |
| 445 | |
| 446 | .. code-block:: yaml |
| 447 | |
| 448 | ceph: |
cdodda | 9b8362c | 2018-04-19 18:06:41 -0500 | [diff] [blame] | 449 | common: |
| 450 | config: |
| 451 | rgw: |
| 452 | key: value |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 453 | radosgw: |
| 454 | enabled: true |
| 455 | hostname: gw.ceph.lab |
| 456 | bind: |
| 457 | address: 10.10.10.1 |
| 458 | port: 8080 |
| 459 | identity: |
| 460 | engine: keystone |
| 461 | api_version: 3 |
| 462 | host: 10.10.10.100 |
| 463 | port: 5000 |
| 464 | user: admin |
| 465 | password: password |
| 466 | project: admin |
| 467 | domain: default |
Jiri Broulik | 4870e80 | 2018-06-25 12:14:34 +0200 | [diff] [blame] | 468 | swift: |
| 469 | versioning: |
| 470 | enabled: true |
Ivan Berezovskiy | 645d444 | 2018-11-21 17:09:54 +0400 | [diff] [blame] | 471 | enforce_content_length: true |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 472 | |
| 473 | |
| 474 | Ceph setup role |
| 475 | --------------- |
| 476 | |
| 477 | Replicated ceph storage pool |
| 478 | |
| 479 | .. code-block:: yaml |
| 480 | |
| 481 | ceph: |
| 482 | setup: |
| 483 | pool: |
| 484 | replicated_pool: |
| 485 | pg_num: 256 |
| 486 | pgp_num: 256 |
| 487 | type: replicated |
Jiri Broulik | 97af8ab | 2017-10-12 14:32:51 +0200 | [diff] [blame] | 488 | crush_rule: sata |
| 489 | application: rbd |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 490 | |
Jiri Broulik | eaf4147 | 2017-10-18 09:56:33 +0200 | [diff] [blame] | 491 | .. note:: For Kraken and earlier releases please specify crush_rule as a ruleset number. |
| 492 | For Kraken and earlier releases application param is not needed. |
| 493 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 494 | Erasure ceph storage pool |
| 495 | |
| 496 | .. code-block:: yaml |
| 497 | |
| 498 | ceph: |
| 499 | setup: |
| 500 | pool: |
| 501 | erasure_pool: |
| 502 | pg_num: 256 |
| 503 | pgp_num: 256 |
| 504 | type: erasure |
Jiri Broulik | 97af8ab | 2017-10-12 14:32:51 +0200 | [diff] [blame] | 505 | crush_rule: ssd |
| 506 | application: rbd |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 507 | |
Jiri Broulik | d68e33a | 2017-10-24 10:54:43 +0200 | [diff] [blame] | 508 | |
Jiri Broulik | e4ba9f6 | 2017-11-08 11:33:00 +0100 | [diff] [blame] | 509 | Inline compression for Bluestore backend |
| 510 | |
| 511 | .. code-block:: yaml |
| 512 | |
| 513 | ceph: |
| 514 | setup: |
| 515 | pool: |
| 516 | volumes: |
| 517 | pg_num: 256 |
| 518 | pgp_num: 256 |
| 519 | type: replicated |
| 520 | crush_rule: hdd |
| 521 | application: rbd |
| 522 | compression_algorithm: snappy |
| 523 | compression_mode: aggressive |
| 524 | compression_required_ratio: .875 |
| 525 | ... |
| 526 | |
| 527 | |
Jiri Broulik | d68e33a | 2017-10-24 10:54:43 +0200 | [diff] [blame] | 528 | Ceph manage keyring keys |
| 529 | ------------------------ |
| 530 | |
| 531 | Keyrings are dynamically generated unless specified by the following pillar. |
| 532 | |
| 533 | .. code-block:: yaml |
| 534 | |
| 535 | ceph: |
| 536 | common: |
| 537 | manage_keyring: true |
| 538 | keyring: |
| 539 | glance: |
| 540 | name: images |
| 541 | key: AACf3ulZFFPNDxAAd2DWds3aEkHh4IklZVgIaQ== |
| 542 | caps: |
| 543 | mon: "allow r" |
| 544 | osd: "allow class-read object_prefix rdb_children, allow rwx pool=images" |
| 545 | |
| 546 | |
Jiri Broulik | 97af8ab | 2017-10-12 14:32:51 +0200 | [diff] [blame] | 547 | Generate CRUSH map - Recommended way |
| 548 | ----------------------------------- |
Tomáš Kukrál | 363d37d | 2017-08-17 13:40:20 +0200 | [diff] [blame] | 549 | |
Jiri Broulik | 97af8ab | 2017-10-12 14:32:51 +0200 | [diff] [blame] | 550 | It is required to define the `type` for crush buckets and these types must start with `root` (top) and end with `host`. OSD daemons will be assigned to hosts according to it's hostname. Weight of the buckets will be calculated according to weight of it's children. |
| 551 | |
| 552 | If the pools that are in use have size of 3 it is best to have 3 children of a specific type in the root CRUSH tree to replicate objects across (Specified in rule steps by 'type region'). |
Tomáš Kukrál | 363d37d | 2017-08-17 13:40:20 +0200 | [diff] [blame] | 553 | |
| 554 | .. code-block:: yaml |
| 555 | |
Jiri Broulik | 97af8ab | 2017-10-12 14:32:51 +0200 | [diff] [blame] | 556 | ceph: |
| 557 | setup: |
| 558 | crush: |
| 559 | enabled: True |
| 560 | tunables: |
| 561 | choose_total_tries: 50 |
| 562 | choose_local_tries: 0 |
| 563 | choose_local_fallback_tries: 0 |
| 564 | chooseleaf_descend_once: 1 |
| 565 | chooseleaf_vary_r: 1 |
| 566 | chooseleaf_stable: 1 |
| 567 | straw_calc_version: 1 |
| 568 | allowed_bucket_algs: 54 |
| 569 | type: |
| 570 | - root |
| 571 | - region |
| 572 | - rack |
| 573 | - host |
Jiri Broulik | eaf4147 | 2017-10-18 09:56:33 +0200 | [diff] [blame] | 574 | - osd |
Jiri Broulik | 97af8ab | 2017-10-12 14:32:51 +0200 | [diff] [blame] | 575 | root: |
| 576 | - name: root-ssd |
| 577 | - name: root-sata |
| 578 | region: |
| 579 | - name: eu-1 |
| 580 | parent: root-sata |
| 581 | - name: eu-2 |
| 582 | parent: root-sata |
| 583 | - name: eu-3 |
| 584 | parent: root-ssd |
| 585 | - name: us-1 |
| 586 | parent: root-sata |
| 587 | rack: |
| 588 | - name: rack01 |
| 589 | parent: eu-1 |
| 590 | - name: rack02 |
| 591 | parent: eu-2 |
| 592 | - name: rack03 |
| 593 | parent: us-1 |
| 594 | rule: |
| 595 | sata: |
| 596 | ruleset: 0 |
| 597 | type: replicated |
| 598 | min_size: 1 |
| 599 | max_size: 10 |
| 600 | steps: |
| 601 | - take take root-ssd |
| 602 | - chooseleaf firstn 0 type region |
| 603 | - emit |
| 604 | ssd: |
| 605 | ruleset: 1 |
| 606 | type: replicated |
| 607 | min_size: 1 |
| 608 | max_size: 10 |
| 609 | steps: |
| 610 | - take take root-sata |
| 611 | - chooseleaf firstn 0 type region |
| 612 | - emit |
| 613 | |
| 614 | |
| 615 | Generate CRUSH map - Alternative way |
| 616 | ------------------------------------ |
| 617 | |
| 618 | It's necessary to create per OSD pillar. |
| 619 | |
| 620 | .. code-block:: yaml |
| 621 | |
| 622 | ceph: |
| 623 | osd: |
| 624 | crush: |
| 625 | - type: root |
| 626 | name: root1 |
| 627 | - type: region |
| 628 | name: eu-1 |
| 629 | - type: rack |
| 630 | name: rack01 |
| 631 | - type: host |
| 632 | name: osd001 |
| 633 | |
Jiri Broulik | 8870b87 | 2018-01-24 18:04:25 +0100 | [diff] [blame] | 634 | Add OSDs with specific weight |
| 635 | ----------------------------- |
| 636 | |
| 637 | Add OSD device(s) with initial weight set specifically to certain value. |
| 638 | |
| 639 | .. code-block:: yaml |
| 640 | |
| 641 | ceph: |
| 642 | osd: |
| 643 | crush_initial_weight: 0 |
| 644 | |
Jiri Broulik | 97af8ab | 2017-10-12 14:32:51 +0200 | [diff] [blame] | 645 | |
| 646 | Apply CRUSH map |
| 647 | --------------- |
| 648 | |
| 649 | Before you apply CRUSH map please make sure that settings in generated file in /etc/ceph/crushmap are correct. |
| 650 | |
| 651 | .. code-block:: yaml |
| 652 | |
| 653 | ceph: |
| 654 | setup: |
| 655 | crush: |
| 656 | enforce: true |
| 657 | pool: |
| 658 | images: |
| 659 | crush_rule: sata |
| 660 | application: rbd |
| 661 | volumes: |
| 662 | crush_rule: sata |
| 663 | application: rbd |
| 664 | vms: |
| 665 | crush_rule: ssd |
| 666 | application: rbd |
| 667 | |
Jiri Broulik | eaf4147 | 2017-10-18 09:56:33 +0200 | [diff] [blame] | 668 | .. note:: For Kraken and earlier releases please specify crush_rule as a ruleset number. |
| 669 | For Kraken and earlier releases application param is not needed. |
| 670 | |
Jiri Broulik | 97af8ab | 2017-10-12 14:32:51 +0200 | [diff] [blame] | 671 | |
| 672 | Persist CRUSH map |
| 673 | -------------------- |
| 674 | |
| 675 | After the CRUSH map is applied to Ceph it's recommended to persist the same settings even after OSD reboots. |
| 676 | |
| 677 | .. code-block:: yaml |
| 678 | |
| 679 | ceph: |
| 680 | osd: |
| 681 | crush_update: false |
| 682 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 683 | |
| 684 | Ceph monitoring |
| 685 | --------------- |
| 686 | |
Jiri Broulik | 4457407 | 2017-11-14 12:27:39 +0100 | [diff] [blame] | 687 | By default monitoring is setup to collect information from MON and OSD nodes. To change the default values add the following pillar to MON nodes. |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 688 | |
| 689 | .. code-block:: yaml |
| 690 | |
| 691 | ceph: |
Simon Pasquier | f8e6f9e | 2017-07-03 10:15:20 +0200 | [diff] [blame] | 692 | monitoring: |
Jiri Broulik | 4457407 | 2017-11-14 12:27:39 +0100 | [diff] [blame] | 693 | space_used_warning_threshold: 0.75 |
| 694 | space_used_critical_threshold: 0.85 |
| 695 | apply_latency_threshold: 0.007 |
| 696 | commit_latency_threshold: 0.7 |
Machi Hoshino | 5068299 | 2018-09-19 11:49:05 +0900 | [diff] [blame] | 697 | pool: |
| 698 | vms: |
| 699 | pool_space_used_utilization_warning_threshold: 0.75 |
| 700 | pool_space_used_critical_threshold: 0.85 |
| 701 | pool_write_ops_threshold: 200 |
| 702 | pool_write_bytes_threshold: 70000000 |
| 703 | pool_read_bytes_threshold: 70000000 |
| 704 | pool_read_ops_threshold: 1000 |
| 705 | images: |
| 706 | pool_space_used_utilization_warning_threshold: 0.50 |
| 707 | pool_space_used_critical_threshold: 0.95 |
| 708 | pool_write_ops_threshold: 100 |
| 709 | pool_write_bytes_threshold: 50000000 |
| 710 | pool_read_bytes_threshold: 50000000 |
| 711 | pool_read_ops_threshold: 500 |
Simon Pasquier | f8e6f9e | 2017-07-03 10:15:20 +0200 | [diff] [blame] | 712 | |
Mateusz Los | 4dd8c4f | 2017-12-01 09:53:02 +0100 | [diff] [blame] | 713 | Ceph monitor backups |
| 714 | -------------------- |
| 715 | |
| 716 | Backup client with ssh/rsync remote host |
| 717 | |
| 718 | .. code-block:: yaml |
| 719 | |
| 720 | ceph: |
| 721 | backup: |
| 722 | client: |
| 723 | enabled: true |
| 724 | full_backups_to_keep: 3 |
| 725 | hours_before_full: 24 |
| 726 | target: |
| 727 | host: cfg01 |
Jiri Broulik | 44feb04 | 2018-03-05 12:10:19 +0100 | [diff] [blame] | 728 | backup_dir: server-backup-dir |
Mateusz Los | 4dd8c4f | 2017-12-01 09:53:02 +0100 | [diff] [blame] | 729 | |
| 730 | Backup client with local backup only |
| 731 | |
| 732 | .. code-block:: yaml |
| 733 | |
| 734 | ceph: |
| 735 | backup: |
| 736 | client: |
| 737 | enabled: true |
| 738 | full_backups_to_keep: 3 |
| 739 | hours_before_full: 24 |
| 740 | |
Martin Polreich | 8d37f28 | 2018-03-04 17:38:15 +0100 | [diff] [blame] | 741 | |
| 742 | Backup client at exact times: |
| 743 | |
| 744 | ..code-block:: yaml |
| 745 | |
| 746 | ceph: |
| 747 | backup: |
| 748 | client: |
| 749 | enabled: true |
| 750 | full_backups_to_keep: 3 |
| 751 | incr_before_full: 3 |
| 752 | backup_times: |
Martin Polreich | fe1b390 | 2018-04-25 15:32:30 +0200 | [diff] [blame] | 753 | day_of_week: 0 |
Martin Polreich | 8d37f28 | 2018-03-04 17:38:15 +0100 | [diff] [blame] | 754 | hour: 4 |
| 755 | minute: 52 |
| 756 | compression: true |
| 757 | compression_threads: 2 |
| 758 | database: |
| 759 | user: user |
| 760 | password: password |
| 761 | target: |
| 762 | host: host01 |
| 763 | |
| 764 | .. note:: Parameters in ``backup_times`` section can be used to set up exact |
| 765 | time the cron job should be executed. In this example, the backup job |
| 766 | would be executed every Sunday at 4:52 AM. If any of the individual |
| 767 | ``backup_times`` parameters is not defined, the defalut ``*`` value will be |
| 768 | used. For example, if minute parameter is ``*``, it will run the backup every minute, |
| 769 | which is ususally not desired. |
Martin Polreich | fe1b390 | 2018-04-25 15:32:30 +0200 | [diff] [blame] | 770 | Available parameters are ``day_of_week``, ``day_of_month``, ``month``, ``hour`` and ``minute``. |
Martin Polreich | 8d37f28 | 2018-03-04 17:38:15 +0100 | [diff] [blame] | 771 | Please see the crontab reference for further info on how to set these parameters. |
| 772 | |
| 773 | .. note:: Please be aware that only ``backup_times`` section OR |
| 774 | ``hours_before_full(incr)`` can be defined. If both are defined, |
| 775 | the ``backup_times`` section will be peferred. |
| 776 | |
| 777 | .. note:: New parameter ``incr_before_full`` needs to be defined. This |
| 778 | number sets number of incremental backups to be run, before a full backup |
| 779 | is performed. |
| 780 | |
Mateusz Los | 4dd8c4f | 2017-12-01 09:53:02 +0100 | [diff] [blame] | 781 | Backup server rsync |
| 782 | |
| 783 | .. code-block:: yaml |
| 784 | |
| 785 | ceph: |
| 786 | backup: |
| 787 | server: |
| 788 | enabled: true |
| 789 | hours_before_full: 24 |
| 790 | full_backups_to_keep: 5 |
| 791 | key: |
| 792 | ceph_pub_key: |
| 793 | enabled: true |
| 794 | key: ssh_rsa |
| 795 | |
Jiri Broulik | 62892df | 2018-02-28 16:22:00 +0100 | [diff] [blame] | 796 | Backup server without strict client restriction |
| 797 | |
| 798 | .. code-block:: yaml |
| 799 | |
| 800 | ceph: |
| 801 | backup: |
| 802 | restrict_clients: false |
| 803 | |
Martin Polreich | 8d37f28 | 2018-03-04 17:38:15 +0100 | [diff] [blame] | 804 | Backup server at exact times: |
| 805 | |
| 806 | ..code-block:: yaml |
| 807 | |
| 808 | ceph: |
| 809 | backup: |
| 810 | server: |
| 811 | enabled: true |
| 812 | full_backups_to_keep: 3 |
| 813 | incr_before_full: 3 |
| 814 | backup_dir: /srv/backup |
| 815 | backup_times: |
Martin Polreich | fe1b390 | 2018-04-25 15:32:30 +0200 | [diff] [blame] | 816 | day_of_week: 0 |
Martin Polreich | 8d37f28 | 2018-03-04 17:38:15 +0100 | [diff] [blame] | 817 | hour: 4 |
| 818 | minute: 52 |
| 819 | key: |
| 820 | ceph_pub_key: |
| 821 | enabled: true |
| 822 | key: key |
| 823 | |
| 824 | .. note:: Parameters in ``backup_times`` section can be used to set up exact |
| 825 | time the cron job should be executed. In this example, the backup job |
| 826 | would be executed every Sunday at 4:52 AM. If any of the individual |
| 827 | ``backup_times`` parameters is not defined, the defalut ``*`` value will be |
| 828 | used. For example, if minute parameter is ``*``, it will run the backup every minute, |
| 829 | which is ususally not desired. |
Martin Polreich | fe1b390 | 2018-04-25 15:32:30 +0200 | [diff] [blame] | 830 | Available parameters are ``day_of_week``, ``day_of_month``, ``month``, ``hour`` and ``minute``. |
Martin Polreich | 8d37f28 | 2018-03-04 17:38:15 +0100 | [diff] [blame] | 831 | Please see the crontab reference for further info on how to set these parameters. |
| 832 | |
| 833 | .. note:: Please be aware that only ``backup_times`` section OR |
| 834 | ``hours_before_full(incr)`` can be defined. If both are defined, The |
| 835 | ``backup_times`` section will be peferred. |
| 836 | |
| 837 | .. note:: New parameter ``incr_before_full`` needs to be defined. This |
| 838 | number sets number of incremental backups to be run, before a full backup |
| 839 | is performed. |
| 840 | |
Jiri Broulik | 4255205 | 2018-02-15 15:23:29 +0100 | [diff] [blame] | 841 | Migration from Decapod to salt-formula-ceph |
| 842 | -------------------------------------------- |
| 843 | |
| 844 | The following configuration will run a python script which will generate ceph config and osd disk mappings to be put in cluster model. |
| 845 | |
| 846 | .. code-block:: yaml |
| 847 | |
| 848 | ceph: |
| 849 | decapod: |
| 850 | ip: 192.168.1.10 |
| 851 | user: user |
| 852 | password: psswd |
| 853 | deploy_config_name: ceph |
Mateusz Los | 4dd8c4f | 2017-12-01 09:53:02 +0100 | [diff] [blame] | 854 | |
Simon Pasquier | f8e6f9e | 2017-07-03 10:15:20 +0200 | [diff] [blame] | 855 | |
Ondrej Smola | 81d1a19 | 2017-08-17 11:13:10 +0200 | [diff] [blame] | 856 | More information |
| 857 | ================ |
jpavlik | 8425d36 | 2015-06-09 15:23:27 +0200 | [diff] [blame] | 858 | |
| 859 | * https://github.com/cloud-ee/ceph-salt-formula |
| 860 | * http://ceph.com/ceph-storage/ |
jan kaufman | 4f7757b | 2015-06-12 10:49:00 +0200 | [diff] [blame] | 861 | * http://ceph.com/docs/master/start/intro/ |