blob: 579ac19a73cdf801b188749785e5f371f20e2b46 [file] [log] [blame]
============
Ceph formula
============
Ceph provides extraordinary data storage scalability. Thousands of client
hosts or KVMs accessing petabytes to exabytes of data. Each one of your
applications can use the object, block or file system interfaces to the same
RADOS cluster simultaneously, which means your Ceph storage system serves as a
flexible foundation for all of your data storage needs.
Use salt-formula-linux for initial disk partitioning.
Daemons
--------
Ceph uses several daemons to handle data and cluster state. Each daemon type requires different computing capacity and hardware optimization.
These daemons are currently supported by formula:
* MON (`ceph.mon`)
* OSD (`ceph.osd`)
* RGW (`ceph.radosgw`)
Architecture decisions
-----------------------
Please refer to upstream achritecture documents before designing your cluster. Solid understanding of Ceph principles is essential for making architecture decisions described bellow.
http://docs.ceph.com/docs/master/architecture/
* Ceph version
There is 3 or 4 stable releases every year and many of nighty/dev release. You should decide which version will be used since the only stable releases are recommended for production. Some of the releases are marked LTS (Long Term Stable) and these releases receive bugfixed for longer period - usually until next LTS version is released.
* Number of MON daemons
Use 1 MON daemon for testing, 3 MONs for smaller production clusters and 5 MONs for very large production cluster. There is no need to have more than 5 MONs in normal environment because there isn't any significant benefit in running more than 5 MONs. Ceph require MONS to form quorum so you need to heve more than 50% of the MONs up and running to have fully operational cluster. Every I/O operation will stop once less than 50% MONs is availabe because they can't form quorum.
* Number of PGs
Placement groups are providing mappping between stored data and OSDs. It is necessary to calculate number of PGs because there should be stored decent amount of PGs on each OSD. Please keep in mind *decreasing number of PGs* isn't possible and *increading* can affect cluster performance.
http://docs.ceph.com/docs/master/rados/operations/placement-groups/
http://ceph.com/pgcalc/
* Daemon colocation
It is recommended to dedicate nodes for MONs and RWG since colocation can have and influence on cluster operations. Howerver, small clusters can be running MONs on OSD node but it is critical to have enough of resources for MON daemons because they are the most important part of the cluster.
Installing RGW on node with other daemons isn't recommended because RGW daemon usually require a lot of bandwith and it harm cluster health.
* Journal location
There are two way to setup journal:
* **Colocated** journal is located (usually at the beginning) on the same disk as partition for the data. This setup is easier for installation and it doesn't require any other disk to be used. However, colocated setup is significantly slower than dedicated)
* **Dedicate** journal is placed on different disk than data. This setup can deliver much higher performance than colocated but it require to have more disks in servers. Journal drives should be carefully selected because high I/O and durability is required.
* Store type (Bluestore/Filestore)
Recent version of Ceph support Bluestore as storage backend and backend should be used if available.
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
* Cluster and public network
Ceph cluster is accessed using network and thus you need to have decend capacity to handle all the client. There are two networks required for cluster: **public** network and cluster network. Public network is used for client connections and MONs and OSDs are listening on this network. Second network ic called **cluster** networks and this network is used for communication between OSDs.
Both networks should have dedicated interfaces, bonding interfaces and dedicating vlans on bonded interfaces isn't allowed. Good practise is dedicate more throughput for the cluster network because cluster traffic is more important than client traffic.
* Pool parameters (size, min_size, type)
You should setup each pool according to it's expected usage, at least `min_size` and `size` and pool type should be considered.
* Cluster monitoring
* Hardware
Please refer to upstream hardware recommendation guide for general information about hardware.
Ceph servers are required to fulfil special requirements becauce load generated by Ceph can be diametrically opposed to common load.
http://docs.ceph.com/docs/master/start/hardware-recommendations/
Basic management commands
------------------------------
Cluster
********
- :code:`ceph health` - check if cluster is healthy (:code:`ceph health detail` can provide more information)
.. code-block:: bash
root@c-01:~# ceph health
HEALTH_OK
- :code:`ceph status` - shows basic information about cluster
.. code-block:: bash
root@c-01:~# ceph status
cluster e2dc51ae-c5e4-48f0-afc1-9e9e97dfd650
health HEALTH_OK
monmap e1: 3 mons at {1=192.168.31.201:6789/0,2=192.168.31.202:6789/0,3=192.168.31.203:6789/0}
election epoch 38, quorum 0,1,2 1,2,3
osdmap e226: 6 osds: 6 up, 6 in
pgmap v27916: 400 pgs, 2 pools, 21233 MB data, 5315 objects
121 GB used, 10924 GB / 11058 GB avail
400 active+clean
client io 481 kB/s rd, 132 kB/s wr, 185 op/
MON
****
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
OSD
****
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
- :code:`ceph osd tree` - show all OSDs and it's state
.. code-block:: bash
root@c-01:~# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-4 0 host c-04
-1 10.79993 root default
-2 3.59998 host c-01
0 1.79999 osd.0 up 1.00000 1.00000
1 1.79999 osd.1 up 1.00000 1.00000
-3 3.59998 host c-02
2 1.79999 osd.2 up 1.00000 1.00000
3 1.79999 osd.3 up 1.00000 1.00000
-5 3.59998 host c-03
4 1.79999 osd.4 up 1.00000 1.00000
5 1.79999 osd.5 up 1.00000 1.00000
- :code:`ceph osd pools ls` - list of pool
.. code-block:: bash
root@c-01:~# ceph osd lspools
0 rbd,1 test
PG
***
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg
- :code:`ceph pg ls` - list placement groups
.. code-block:: bash
root@c-01:~# ceph pg ls | head -n 4
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
0.0 11 0 0 0 0 46137344 3044 3044 active+clean 2015-07-02 10:12:40.603692 226'10652 226:1798 [4,2,0] 4 [4,2,0] 4 0'0 2015-07-01 18:38:33.126953 0'0 2015-07-01 18:17:01.904194
0.1 7 0 0 0 0 25165936 3026 3026 active+clean 2015-07-02 10:12:40.585833 226'5808 226:1070 [2,4,1] 2 [2,4,1] 2 0'0 2015-07-01 18:38:32.352721 0'0 2015-07-01 18:17:01.904198
0.2 18 0 0 0 0 75497472 3039 3039 active+clean 2015-07-02 10:12:39.569630 226'17447 226:3213 [3,1,5] 3 [3,1,5] 3 0'0 2015-07-01 18:38:34.308228 0'0 2015-07-01 18:17:01.904199
- :code:`ceph pg map 1.1` - show mapping between PG and OSD
.. code-block:: bash
root@c-01:~# ceph pg map 1.1
osdmap e226 pg 1.1 (1.1) -> up [5,1,2] acting [5,1,2]
Sample pillars
==============
Common metadata for all nodes/roles
.. code-block:: yaml
ceph:
common:
version: luminous
config:
global:
param1: value1
param2: value1
param3: value1
pool_section:
param1: value2
param2: value2
param3: value2
fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
members:
- name: cmn01
host: 10.0.0.1
- name: cmn02
host: 10.0.0.2
- name: cmn03
host: 10.0.0.3
keyring:
admin:
key: AQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
caps:
mds: "allow *"
mgr: "allow *"
mon: "allow *"
osd: "allow *"
bootstrap-osd:
key: BQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
caps:
mon: "allow profile bootstrap-osd"
Optional definition for cluster and public networks. Cluster network is used
for replication. Public network for front-end communication.
.. code-block:: yaml
ceph:
common:
version: luminous
fsid: a619c5fc-c4ed-4f22-9ed2-66cf2feca23d
....
public_network: 10.0.0.0/24, 10.1.0.0/24
cluster_network: 10.10.0.0/24, 10.11.0.0/24
Ceph mon (control) roles
------------------------
Monitors: A Ceph Monitor maintains maps of the cluster state, including the
monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map.
Ceph maintains a history (called an “epoch”) of each state change in the Ceph
Monitors, Ceph OSD Daemons, and PGs.
.. code-block:: yaml
ceph:
common:
config:
mon:
key: value
mon:
enabled: true
keyring:
mon:
key: AQAnQIhZ6in5KxAAdf467upoRMWFcVg5pbh1yg==
caps:
mon: "allow *"
admin:
key: AQBHPYhZv5mYDBAAvisaSzCTQkC5gywGUp/voA==
caps:
mds: "allow *"
mgr: "allow *"
mon: "allow *"
osd: "allow *"
Ceph mgr roles
------------------------
The Ceph Manager daemon (ceph-mgr) runs alongside monitor daemons, to provide additional monitoring and interfaces to external monitoring and management systems. Since the 12.x (luminous) Ceph release, the ceph-mgr daemon is required for normal operations. The ceph-mgr daemon is an optional component in the 11.x (kraken) Ceph release.
By default, the manager daemon requires no additional configuration, beyond ensuring it is running. If there is no mgr daemon running, you will see a health warning to that effect, and some of the other information in the output of ceph status will be missing or stale until a mgr is started.
.. code-block:: yaml
ceph:
mgr:
enabled: true
dashboard:
enabled: true
host: 10.103.255.252
port: 7000
Ceph OSD (storage) roles
------------------------
.. code-block:: yaml
ceph:
common:
config:
osd:
key: value
osd:
enabled: true
ceph_host_id: '39'
journal_size: 20480
bluestore_block_db_size: 1073741824 (1G)
bluestore_block_wal_size: 1073741824 (1G)
bluestore_block_size: 807374182400 (800G)
backend:
filestore:
disks:
- dev: /dev/sdm
enabled: false
rule: hdd
journal: /dev/ssd
fs_type: xfs
class: bestssd
weight: 1.5
- dev: /dev/sdl
rule: hdd
journal: /dev/ssd
fs_type: xfs
class: bestssd
weight: 1.5
bluestore:
disks:
- dev: /dev/sdb
- dev: /dev/sdc
block_db: /dev/ssd
block_wal: /dev/ssd
class: ssd
weight: 1.666
- dev: /dev/sdd
enabled: false
Ceph client roles - ...Deprecated - use ceph:common instead
--------------------------------------------------------
Simple ceph client service
.. code-block:: yaml
ceph:
client:
config:
global:
mon initial members: ceph1,ceph2,ceph3
mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789
keyring:
monitoring:
key: 00000000000000000000000000000000000000==
At OpenStack control settings are usually located at cinder-volume or glance-
registry services.
.. code-block:: yaml
ceph:
client:
config:
global:
fsid: 00000000-0000-0000-0000-000000000000
mon initial members: ceph1,ceph2,ceph3
mon host: 10.103.255.252:6789,10.103.255.253:6789,10.103.255.254:6789
osd_fs_mkfs_arguments_xfs:
osd_fs_mount_options_xfs: rw,noatime
network public: 10.0.0.0/24
network cluster: 10.0.0.0/24
osd_fs_type: xfs
osd:
osd journal size: 7500
filestore xattr use omap: true
mon:
mon debug dump transactions: false
keyring:
cinder:
key: 00000000000000000000000000000000000000==
glance:
key: 00000000000000000000000000000000000000==
Ceph gateway
------------
Rados gateway with keystone v2 auth backend
.. code-block:: yaml
ceph:
radosgw:
enabled: true
hostname: gw.ceph.lab
bind:
address: 10.10.10.1
port: 8080
identity:
engine: keystone
api_version: 2
host: 10.10.10.100
port: 5000
user: admin
password: password
tenant: admin
Rados gateway with keystone v3 auth backend
.. code-block:: yaml
ceph:
radosgw:
enabled: true
hostname: gw.ceph.lab
bind:
address: 10.10.10.1
port: 8080
identity:
engine: keystone
api_version: 3
host: 10.10.10.100
port: 5000
user: admin
password: password
project: admin
domain: default
Ceph setup role
---------------
Replicated ceph storage pool
.. code-block:: yaml
ceph:
setup:
pool:
replicated_pool:
pg_num: 256
pgp_num: 256
type: replicated
crush_ruleset_name: 0
Erasure ceph storage pool
.. code-block:: yaml
ceph:
setup:
pool:
erasure_pool:
pg_num: 256
pgp_num: 256
type: erasure
crush_ruleset_name: 0
Generate CRUSH map
--------------------
It is required to define the `type` for crush buckets and these types must start with `root` (top) and end with `host`. OSD daemons will be assigned to hosts according to it's hostname. Weight of the buckets will be calculated according to weight of it's childen.
.. code-block:: yaml
ceph:
setup:
crush:
enabled: True
tunables:
choose_total_tries: 50
type:
- root
- region
- rack
- host
root:
- name: root1
- name: root2
region:
- name: eu-1
parent: root1
- name: eu-2
parent: root1
- name: us-1
parent: root2
rack:
- name: rack01
parent: eu-1
- name: rack02
parent: eu-2
- name: rack03
parent: us-1
rule:
sata:
ruleset: 0
type: replicated
min_size: 1
max_size: 10
steps:
- take crushroot.performanceblock.satahss.1
- choseleaf firstn 0 type failure_domain
- emit
Ceph monitoring
---------------
Collect general cluster metrics
.. code-block:: yaml
ceph:
monitoring:
cluster_stats:
enabled: true
ceph_user: monitoring
Collect metrics from monitor and OSD services
.. code-block:: yaml
ceph:
monitoring:
node_stats:
enabled: true
More information
================
* https://github.com/cloud-ee/ceph-salt-formula
* http://ceph.com/ceph-storage/
* http://ceph.com/docs/master/start/intro/
Documentation and bugs
======================
To learn how to install and update salt-formulas, consult the documentation
available online at:
http://salt-formulas.readthedocs.io/
In the unfortunate event that bugs are discovered, they should be reported to
the appropriate issue tracker. Use Github issue tracker for specific salt
formula:
https://github.com/salt-formulas/salt-formula-ceph/issues
For feature requests, bug reports or blueprints affecting entire ecosystem,
use Launchpad salt-formulas project:
https://launchpad.net/salt-formulas
You can also join salt-formulas-users team and subscribe to mailing list:
https://launchpad.net/~salt-formulas-users
Developers wishing to work on the salt-formulas projects should always base
their work on master branch and submit pull request against specific formula.
https://github.com/salt-formulas/salt-formula-ceph
Any questions or feedback is always welcome so feel free to join our IRC
channel:
#salt-formulas @ irc.freenode.net