http://bbs.ceph.org.cn/article/83
1. 构成与预检
————————————————————————————————————————————————构成
节点名称 │ 用户名 │ OS版本 │ 机器类型管理节点 bees Ubuntu14.04 Physical
monitor1 bees Ubuntu14.04 KVM
osd1 bees Ubuntu14.04 KVM
osd2 bees Ubuntu14.04 KVM
预检
1. 安装ceph部署工具(管理节点)$ wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
$ echo deb http://download.ceph.com/debian-{ceph-stable-release}/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
$ sudo apt-get update
问题 1:2. 安装ntp服务并配置(所有节点)
bees@monitor1:~$ wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
gpg: no valid OpenPGP data found.
原因:
没有配置wget代理。
解决办法:
配置wget代理。
问题 2:
如果root用户使用wget正常,但是非root用户(本例中是bees用户)使用wget出现问题。
bees@monitor1:/root$ sudo wget -O release.asc https://download.ceph.com/keys ... 05-09 16:38:03-- https://download.ceph.com/keys/release.ascX92XResolving download.ceph.com (download.ceph.com)... failed: No address associated with hostname.
wget: unable to resolve host address download.ceph.com
原因:
在root用户下配置wget代理。
解决办法:
在非root用户下(本例中是bees用户)下配置wget代理。
在所有ceph节点上配置ntp,并同步时间。此处为示例。
$ sudo apt-get install ntp
--------------------------------------
#server 0.ubuntu.pool.ntp.org
#server 1.ubuntu.pool.ntp.org
#server 2.ubuntu.pool.ntp.org
#server 3.ubuntu.pool.ntp.org
server 127.127.1.0
3. 安装ssh服务(所有节点)
$ sudo apt-get install openssh-server
4. 无密码访问(管理节点)
• 生成密钥
$ ssh-keygen
Generating public/private key pair.
Enter file in which to save the key (/ceph-admin/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /ceph-admin/.ssh/id_rsa.
Your public key has been saved in /ceph-admin/.ssh/id_rsa.pub.
• 将公钥拷贝到各个ceph节点
$ ssh-copy-id bees@monitor1
$ ssh-copy-id bees@osd1
$ ssh-copy-id bees@osd2
• 修改管理节点的 ~/.ssh/config 文件, 添加如下内容
Host monitor1
Hostname monitor1
User bees
Host osd1
Hostname osd1
User bees
Host osd2
Hostname osd2
User bees
5. 修改防火墙规则(所有节点)
• 删除iptables,ubuntu默认不安装firewall。
$ ufw disable
$ apt-get remove iptables
如果有安全需要,推荐制定防火墙规则。6. 配置apt-get源(所有节点)
/etc/apt/sources.list
----------------------------------
deb http://archive.ubuntu.com/ubuntu/ trusty main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ trusty-security main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ trusty-updates main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ trusty-proposed main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ trusty-backports main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu/ trusty main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu/ trusty-security main restricted universe multiverse
7. 配置主机名(所有节点)
/etc/hosts
-----------------------------------
193.168.123.90 bees1
193.168.123.67 bees2
193.168.123.89 osd1
193.168.123.58 monitor1
193.168.123.145 osd2
2. 快速安装(管理节点)
————————————————————————————————————————————————1. 创建集群目录,保存ceph-deploy生成的配置文件及密钥对
推荐使用非root用户(本例中是bees用户)创建。$ mkdir my-cluster
$ cd my-cluster
2. 创建集群
$ ceph-deploy new monitor1
3. 允许两个osd也能达到active clean状态。在当前目录下ceph.conf文件的[global]字段中添加如下内容
osd pool default size = 2
4. 如果有多个网卡,将public network也写入ceph.conf文件的[global]字段中
public_network = 193.168.123.0/24
5. 在各个节点上安装ceph
$ ceph-deploy install monitor1 osd1 osd2
问题:因为之前安装ceph-base包出现问题,现在只能手动下载ceph-base安装包并强制覆盖安装。
Preparing to unpack .../ceph-base_10.2.1-1trusty_amd64.deb ...
Unpacking ceph-base (10.2.1-1trusty) ...
dpkg: error processing archive /var/cache/apt/archives/ceph-base_10.2.1-1trusty_amd64.deb (--unpack):
trying to overwrite '/usr/share/man/man8/ceph-deploy.8.gz', which is also in package ceph-deploy 1.4.0-0ubuntu1
Selecting previously unselected package ceph-fs-common.
Preparing to unpack .../ceph-fs-common_10.2.1-1trusty_amd64.deb ...
Unpacking ceph-fs-common (10.2.1-1trusty) ...
Selecting previously unselected package ceph-fuse.
Preparing to unpack .../ceph-fuse_10.2.1-1trusty_amd64.deb ...
Unpacking ceph-fuse (10.2.1-1trusty) ...
Selecting previously unselected package ceph-mds.
Preparing to unpack .../ceph-mds_10.2.1-1trusty_amd64.deb ...
Unpacking ceph-mds (10.2.1-1trusty) ...
Processing triggers for ureadahead (0.100.0-16) ...
Processing triggers for man-db (2.6.7.1-1) ...
Errors were encountered while processing:
/var/cache/apt/archives/ceph-base_10.2.1-1trusty_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
原因:
之前安装ceph-base包出现问题。
解决办法:
$ dpkg -i --force-overwrite /var/cache/apt/archives/ceph-base_10.2.1-1trusty_amd64.deb
6. 初始化monitor节点
$ ceph-deploy mon create-initial
3. 配置osd节点(管理节点)
————————————————————————————————————————————————1. 为osd守护进程创建所需的磁盘。sdb最为OSD守护进程磁盘,sda作为日志磁盘。
$ ssh osd1
$ sudo mkfs.xfs /dev/sda -f
$ mkfs.xfs /dev/sdb -f
$ exit
$ ssh osd2
$ sudo mkfs.xfs /dev/sda -f
$ mkfs.xfs /dev/sdb -f
$ exit
2. 擦净磁盘,比如分区表等。
$ ceph-deploy disk zap osd1:sda
$ ceph-deploy disk zap osd1:sdb
$ ceph-deploy disk zap osd2:sda
$ ceph-deploy disk zap osd2:sdb
3. 准备osd节点
$ ceph-deploy osd prepare osd1:sdb:/dev/sda
$ ceph-deploy osd prepare osd2:sdb:/dev/sda
4. 激活osd节点
$ ceph-deploy osd activate osd1:/dev/sdb1:/dev/sda1
$ ceph-deploy osd activate osd2:/dev/sdb1:/dev/sda1
5. 将配置文件和admin密钥拷贝到ceph所有节点
$ ceph-deploy admin bees2 monitor1 osd1 osd2
问题:在卸载ceph后,没有删除管理节点的ceph配置,导致新生成的文件和以前的文件内容有所不同。强制覆盖。
[ceph_deploy.admin][ERROR ] RuntimeError: config file /etc/ceph/ceph.conf exists with different conte use --overwrite-conf to overwrite
原因:
卸载ceph之后并没有删除管理节点的ceph配置文件,新生成的ceph配置文件和之前的出现差异。
解决办法:
$ ceph-deploy --overwrite-conf admin bees2 monitor1 osd1 osd2
6. 添加对ceph.client.admin.keyring 有正确的操作权限
$ sudo chmod +r /etc/ceph/ceph.client.admin.keyring
7. 检查集群的健康情况,集群应该是active clean状态
$ ceph health
HEALTH_OK
$ ceph -s
cluster 54356b3d-be17-4d5c-a8b0-804420caa59d
health HEALTH_OK
monmap e1: 1 mons at {monitor1=193.168.123.58:6789/0}
election epoch 3, quorum 0 monitor1
osdmap e10: 2 osds: 2 up, 2 in
flags sortbitwise
pgmap v23: 64 pgs, 1 pools, 0 bytes data, 0 objects
68380 kB used, 20391 MB / 20457 MB avail
64 active clean
4. 问题一览
————————————————————————————————————————————————以下问题是发生在
1)使用root用户配置ceph集群。
2)osd守护进程使用ext4格式的磁盘。
的情况。
问题 1
安装好虚拟机之后,设置桥接方式。发现主机A中的虚拟机ping不通主机B。主机B中的虚拟机ping不通主机A。但是主机A和主机是可以相互ping通。主机A —————————————— 主机B (可以)
主机A中的虚拟机 ————————— 主机B (不可以)
主机A —————————————— 主机B中的虚拟机 (不可以)
* 原因:
公司网络限制。
解决办法
使用公司白名单上的MAC地址。
问题 2
在使用apt-get更新源的时候,出现如下问题。root@monitor1:/etc/apt# apt-get update
E: Method http has died unexpectedly!
E: Sub-process http received signal 6.
root@monitor1:/etc/apt#
原因:
公司网络限制。
解决办法
使用能够访问外网的MAC地址。
问题 3
使用目录作为osd守护进程。当activate osd设备的时候出现如下错误。[osd1][WARNIN] 2016-05-22 16:02:20.403039 7f859771e800 -1 asok(0x7f85a1ffc280) AdminSocketConfigObs::init: failed:
AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-osd.0.asok': (13) Permission denied
[osd1][WARNIN] 2016-05-22 16:02:20.403601 7f859771e800 -1 filestore(/var/local/osd1) mkfs: write_version_stamp() failed: (13) Permission denied
[osd1][WARNIN] 2016-05-22 16:02:20.403630 7f859771e800 -1 OSD::mkfs: ObjectStore::mkfs failed with error -13
[osd1][WARNIN] 2016-05-22 16:02:20.403682 7f859771e800 -1 ** ERROR: error creating empty object store in /var/local/osd1: (13) Permission denied
[osd1][WARNIN] Traceback (most recent call last):
[osd1][WARNIN] File "/usr/sbin/ceph-disk", line 9, in <module>
[osd1][WARNIN] load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
[osd1][WARNIN] File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4964, in run
[osd1][WARNIN] main(sys.argv[1:])
[osd1][WARNIN] File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4915, in main
[osd1][WARNIN] args.func(args)
[osd1][WARNIN] File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3277, in main_activate
[osd1][WARNIN] init=args.mark_init,
[osd1][WARNIN] File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3097, in activate_dir
[osd1][WARNIN] (osd_id, cluster) = activate(path, activate_key_template, init)
[osd1][WARNIN] File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3202, in activate
[osd1][WARNIN] keyring=keyring,
[osd1][WARNIN] File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2695, in mkfs
[osd1][WARNIN] '--setgroup', get_ceph_group(),
[osd1][WARNIN] File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 439, in command_check_call
[osd1][WARNIN] return subprocess.check_call(arguments)
[osd1][WARNIN] File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
[osd1][WARNIN] raise CalledProcessError(retcode, cmd)
[osd1][WARNIN] subprocess.CalledProcessError: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', '0', '--monmap', '/var/local/osd1/activate.monmap', '--osd-data', '/var/local/osd1', '--osd-journal', '/var/local/osd1/journal', '--osd-uuid', 'cb9d8962-75f7-4cb1-8a99-ca8044ee283f', '--keyring', '/var/local/osd1/keyring', '--setuser', 'ceph', '--setgroup', 'ceph']' returned non-zero exit status 1
[osd1][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init upstart --mount /var/local/osd1
原因:
对/var/local/osd1没有相关权限。
解决办法:
给/var/local/osd1添加所有权限。
root@osd1:/home/bees# chmod 777 /var/local/osd1
问题 4
ceph_disk.main.Error: Error: another ceph osd.0 already mounted in position (old/different cluster instance?); unmounting ours.
原因:
在ceph节点上,/var/lib/ceph/osd/目录下的某个osd进程正在使用这个磁盘。
解决办法:
1. 换一个磁盘或者目录。如果还是出现此问题,使用方法2。
2. 删除/var/lib/ceph/osd/目录下使用此磁盘的osd。
如果主机上有多个osd守护进程,注意不要删错了。
问题 5
在查看ceph集群状态的时候,出现如下问题root@bees2:/home/my-cluster# ceph health
HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds; 64 pgs stuck inactive
原因:
因为本次osd守护进程所在磁盘格式为ext4。
解决办法:
1. 重新添加一块磁盘,推荐格式化为xfs。
2. 在osd字段下添加 filestore xattr use omap = true。方法2暂未尝试。
问题 6
root@bees2:/home/my-cluster# ceph -s
cluster 15e780dc-f32c-47f8-8105-54a45aaa167d
health HEALTH_ERR
2 pgs are stuck inactive for more than 300 seconds
62 pgs degraded
64 pgs stale
2 pgs stuck stale
62 pgs stuck unclean
62 pgs undersized
monmap e1: 1 mons at {monitor1=193.168.123.58:6789/0}
election epoch 9, quorum 0 monitor1
osdmap e491: 2 osds: 2 up, 2 in; 62 remapped pgs
flags sortbitwise
pgmap v2421: 64 pgs, 1 pools, 0 bytes data, 0 objects
79208 kB used, 30620 MB / 30697 MB avail
62 stale active undersized degraded
2 stale active clean
原因:
暂不清楚。
解决办法
卸载ceph并清除配置,并重新安装ceph。给出两点建议
1. 使用普通用户执行ceph-deploy。
2. 最好不要使用ext4的磁盘,推荐使用xfs。