自建基于Cephfs的NFS和S3高可用集群


自建基于Ceph的NFS和S3高可用集群

应总工程师的的要求,搭建一套Ceph集群,要求可以达到NFS高可用、S3服务高可用。主要用于测试,积累经验后可用于搭建自己使用的备份存储。

1. 配置环境

新建三台虚拟机,每台虚拟机有三块块硬盘(两块用于做OSD)、两个网卡,分属两个网段,规划其IP分别为:

  • node1 public:192.168.40.61; cluster:172.18.0.61
  • node2 public:192.168.40.62; cluster:172.18.0.62
  • node3 public:192.168.40.63; cluster:172.18.0.63

修改所有虚拟机的hosts

[root@node1 ~]# vim /etc/hosts
192.168.40.61   node1
192.168.40.62   node2
192.168.40.63   node3

SSH设置免密登录

免密登录

[root@node1 ~]# ssh-copy-id root@node01  # 若提示没有秘钥,就先用ssh-keygen生成秘钥
[root@node1 ~]# ssh-copy-id root@node02
[root@node1 ~]# ssh-copy-id root@node03

内核升级

官方推荐内核版本4.x以上

根据下面的内容进行内核升级。

[root@node1 ~]# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org  # 导入ELRepo仓库的公共密钥
[root@node1 ~]# yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm  # 安装ELRepo仓库的yum源
[root@node1 ~]# yum --disablerepo="*" --enablerepo="elrepo-kernel" list available  # 查看可用系统内核包,可以看到5.4和5.16两个版本
[root@node1 ~]# yum --enablerepo=elrepo-kernel install kernel-ml  # --enablerepo 选项开启 CentOS 系统上的指定仓库。默认开启的是 elrepo,这里用 elrepo-kernel 替换。
# 内核安装好后,需要设置为默认启动选项并重启后才会生效
[root@node1 ~]# awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg  # 查看系统上的所有可用内核:
[root@node1 ~]# grub2-set-default 0 # 其中 0 是上面查询出来的可用内核
[root@node1 ~]# grub2-mkconfig -o /boot/grub2/grub.cfg  # 生成 grub 配置文件
[root@node1 ~]# reboot  # 重启
[root@node1 ~]# uname -r  # 验证

关闭防火墙和SELinux

[root@node1 ~]# systemctl stop firewalld
[root@node1 ~]# systemctl disable firewalld
[root@node1 ~]# setenforce 0
[root@node1 ~]# vi /etc/selinux/config
修改SELINUX=disabled
SELINUX=disabled

或者直接运行以下命令
[root@node1 ~]# sed -i 's/=enforcing/=disabled/' /etc/selinux/config

时间同步

在所有节点上安装chrony

yum -y install  chrony

在node1节点上配置chrony服务

[root@node1 ~]# vim /etc/chrony.conf 
server ntp.aliyun.com iburst  # 注释掉其他server
......
#allow 192.168.0.0/16
allow 192.168.40.0/24  # 添加允许访问的网段
[root@node1 ~]# systemctl enable chronyd
[root@node1 ~]# systemctl start chronyd

node2、node3删除其他server,只有一个server

[root@node2 ~]# vim /etc/chrony.conf
......
server 192.168.40.61 iburst
[root@node2 ~]# systemctl enable chronyd
[root@node2 ~]# systemctl start chronyd
[root@node2 ~]# chronyc sources -v
210 Number of sources = 1
...
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^? node1                         0   8     0     -     +0ns[   +0ns] +/-    0ns

配置yum源

所有节点都要配置。

[root@node1 ~]# yum install -y epel-release
[root@node1 ~]# vim /etc/yum.repos.d/ceph.repo 
[Ceph]
name=Ceph packages for $basearch
baseurl=http://mirrors.aliyun.com/ceph/rpm-nautilus/el7/x86_64/
enabled=1
gpgcheck=0
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
[Ceph-noarch]
name=Ceph noarch packages
baseurl=http://mirrors.aliyun.com/ceph/rpm-nautilus/el7/noarch
enabled=1
gpgcheck=0
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
[ceph-source]
name=Ceph source packages
baseurl=http://mirrors.aliyun.com/ceph/rpm-nautilus/el7/SRPMS
enabled=1
gpgcheck=0
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
[root@node1 ~]# yum clean all && yum makecache
[root@node1 ~]# yum update
[root@node1 ~]# yum install ceph-deploy -y  # 只在主节点安装
[root@node1 ~]# yum install -y ceph ceph-mon ceph-mgr ceph-mgr-dashboard ceph-radosgw ceph-mds ceph-osd  # 在所有节点上都要安装

2. 安装和配置Ceph

nfs-ganesha配置样例

[root@node1 ~]# mkdir -p /root/my-cluster  # 用户存放Ceph最初的配置文件
[root@node1 ~]# cd ~/my-cluster
[root@node1 my-cluster]# ceph-deploy new --public-network 192.168.40.0/24 --cluster-network 172.18.0.0/24 node1  # 创建一个ceph集群,mon为node1
[root@node1 my-cluster]# ceph-deploy mon create-initial  # 初始化monitor
[root@node1 my-cluster]# ceph-deploy admin node1 node2 node3  # 给节点分配秘钥和配置文件
[root@node1 my-cluster]# ceph-deploy mgr create node1  # 配置配置Manager节点
[root@node1 my-cluster]# ceph-deploy mgr create node2 node3
[root@node1 my-cluster]# ceph-deploy mon create node2  # 扩展mon节点
[root@node1 my-cluster]# ceph-deploy mon create node3
[root@node1 my-cluster]# ceph -s
[root@node1 my-cluster]# ceph-deploy osd create node1 --data /dev/sdb  # 添加OSD,事先用lsblk确认对应的盘符
[root@node1 my-cluster]# ceph-deploy osd create node2 --data /dev/sdb
[root@node1 my-cluster]# ceph-deploy osd create node3 --data /dev/sdb
[root@node1 my-cluster]# ceph-deploy osd create node1 --data /dev/sdc
[root@node1 my-cluster]# ceph-deploy osd create node2 --data /dev/sdc
[root@node1 my-cluster]# ceph-deploy osd create node3 --data /dev/sdc
[root@node1 my-cluster]# ceph osd tree  # 确认OSD状态
[root@node1 my-cluster]# ceph -s  # 确认ceph集群的状态
[root@node1 my-cluster]# ceph mgr module enable dashboard  # 开启dashboard
[root@node1 my-cluster]# ceph dashboard create-self-signed-cert  # 创建证书
[root@node1 my-cluster]# ceph dashboard set-login-credentials admin 123456  # 创建 web 登录用户密码
[root@node1 my-cluster]# ceph mgr services  # 查看服务访问方式

安装dashboard后就可以用 https://192.168.40.61:8443/ 这个地址来查看ceph集群的状态了。

3.安装和配置nfs-ganesha

所有节点上都要安装nfs-ganesha。

[root@node1 my-cluster]# vim /etc/yum.repos.d/nfs-ganasha.repo
[nfsganesha]
name=nfsganesha
baseurl=https://mirrors.cloud.tencent.com/ceph/nfs-ganesha/rpm-V2.8-stable/nautilus/x86_64/
gpgcheck=0
enable=1
[root@node1 my-cluster]# yum makecache
[root@node1 my-cluster]# yum install -y nfs-ganesha nfs-ganesha-ceph  nfs-ganesha-rados-grace nfs-ganesha-rgw nfs-utils rpcbind haproxy keepalived
[root@node1 my-cluster]# ceph-deploy mds create node1 node2 node3
[root@node1 my-cluster]# ceph osd pool create fs-meta 32
[root@node1 my-cluster]# ceph osd pool create fs-data 128
[root@node1 my-cluster]# ceph fs new cephfs fs-meta fs-data  # 创建cephfs
[root@node1 my-cluster]# ceph fs ls  # 查看cephfs
[root@node1 my-cluster]# ceph-deploy --overwrite-conf admin node1 node2 node3   # 更新配置文件
[root@node1 my-cluster]# mkdir /mnt/cephfs
[root@node1 my-cluster]# mount -t ceph 192.168.40.61:/ /mnt/cephfs/ -o name=admin,secret=AQDXrtNhaD/VOBAAuVtilymuHIkb9elyH6bCVQ==  # 挂载一下试试,secret的值从/etc/ceph/ceph.client.admin.keyring中获得
[root@node1 my-cluster]# mkdir -p /mnt/cephfs/nfs1  # 创建两个文件夹
[root@node1 my-cluster]# mkdir -p /mnt/cephfs/nfs2
[root@node1 my-cluster]# vim /etc/ganesha/ganesha.conf  # 配置以下内容
NFS_CORE_PARAM {
    Enable_NLM = false;
    NFS_Port = 52049;
    Enable_RQUOTA = false;
}
EXPORT_DEFAULTS {
    Access_Type = RW;
    Anonymous_uid = 65534;
    Anonymous_gid = 65534;
}
LOG {
    Default_Log_Level = INFO;
    Facility {
        name = FILE;
        destination = "/var/log/ganesha/ganesha.log";
        enable = active;
    }
}

NFSv4 {
    # Delegations = false;
    # RecoveryBackend = 'rados_cluster';
    # Minor_Versions = 1,2;
}

EXPORT
{
        Export_Id = 1;
        Path = /nfs1;
        Pseudo = /nfs1;
        Squash = no_root_squash;
        Access_Type = RW;
        FSAL {
            secret_access_key = "AQDXrtNhaD/VOBAAuVtilymuHIkb9elyH6bCVQ==";
            user_id = "admin";
            name = "CEPH";
            filesystem = "cephfs";
        }
}
EXPORT
{
        Export_Id = 2;
        Path = /nfs2;
        Pseudo = /nfs2;
        Squash = no_root_squash;
        Access_Type = RW;
        FSAL {
            secret_access_key = "AQDXrtNhaD/VOBAAuVtilymuHIkb9elyH6bCVQ==";
            user_id = "admin";
            name = "CEPH";
            filesystem = "cephfs";
        }
}

[root@node1 my-cluster]# systemctl start nfs-ganesha
[root@node1 my-cluster]# systemctl enable nfs-ganesha
[root@node1 my-cluster]# scp /etc/ganesha/ganesha.conf root@node2:/etc/ganesha/ganesha.conf  # 拷贝配置文件到另外两个节点上
[root@node1 my-cluster]# scp /etc/ganesha/ganesha.conf root@node3:/etc/ganesha/ganesha.conf
[root@node1 my-cluster]# ssh root@node2 systemctl start nfs-ganesha  # 2、 3节点重启nfs-ganesha 服务
[root@node1 my-cluster]# ssh root@node3 systemctl start nfs-ganesha
[root@node1 my-cluster]# ssh root@node2 systemctl enable nfs-ganesha
[root@node1 my-cluster]# ssh root@node3 systemctl enable nfs-ganesha
[root@node1 my-cluster]# systemctl status nfs-ganesha
[root@node1 my-cluster]# showmount -e node1
Export list for node1:
/nfs1 (everyone)
/nfs2 (everyone)

4. 配置S3服务

[root@node1 my-cluster]# ceph-deploy rgw create node1 node2
[root@node1 my-cluster]# systemctl status ceph-radosgw@rgw.node1.service
[root@node1 my-cluster]# ceph -s

配置S3的时候出现了rgw服务启动失败的情况。
google了一番之后,在这里找到了大致的原因:pg_num < pgp_num or mon_max_pg_per_osd exceeded。使用以下命令进行调试:

[root@node1 my-cluster]# /usr/bin/radosgw -d --cluster ceph --name client.rgw.node1 --setuser ceph --setgroup ceph --debug-rgw=2
好像是pg太多了,因为之前手动添加了几个pool,而rgw服务是会自动创建所需要的pool的(数据池除外),这里没有必要手动创建。
删除多余的pool
[root@node1 my-cluster]# ceph osd pool delete .rgw.control .rgw.control --yes-i-really-really-mean-it 
[root@node1 my-cluster]# systemctl start ceph-radosgw@rgw.node1.service
[root@node1 my-cluster]# systemctl status ceph-radosgw@rgw.node1.service  # rgw服务正常了

创建存储池

[root@node1 my-cluster]# ceph osd crush class ls
[
    "hdd"
]
[root@node1 my-cluster]# ceph osd crush rule create-replicated rule-hdd default host hdd
[root@node1 my-cluster]# [root@node1 my-cluster]# ceph osd crush rule ls
replicated_rule
rule-hdd
[root@node1 my-cluster]# ceph osd pool create default.rgw.buckets.data 64  # 创建存储池,efault.rgw.buckets.index已经存在,不需重复创建
[root@node1 my-cluster]# ceph osd pool application enable default.rgw.buckets.data rgw
[root@node1 my-cluster]# ceph osd pool application enable default.rgw.buckets.index rgw
修改所有存储池的crush规则,在node1上执行:
[root@node1 my-cluster]# for i in `ceph osd lspools | grep -v data | awk '{print $2}'`; do ceph osd pool set $i crush_rule rule-hdd; done
[root@node1 my-cluster]# ceph osd pool set default.rgw.buckets.data crush_rule rule-hdd
[root@node1 my-cluster]# radosgw-admin user create --uid="testuser" --display-name="First User"  # 创建一个S3用户
{
    "user_id": "testuser",
    "display_name": "First User",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "testuser",
            "access_key": "LOLDHV9L1CS12586AQ4Y",
            "secret_key": "9aHAOD8vOTwTI5OpBAbVPL35QqJA8yfZLQOI7jHJ"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}
[root@node1 my-cluster]# yum install python-boto  # 安装python-boto模块
[root@node1 my-cluster]# vi s3test.py # 写个简单的脚本,创建bucket
import boto.s3.connection

access_key = 'LOLDHV9L1CS12586AQ4Y'
secret_key = '9aHAOD8vOTwTI5OpBAbVPL35QqJA8yfZLQOI7jHJ'
conn = boto.connect_s3(
        aws_access_key_id=access_key,
        aws_secret_access_key=secret_key,
        host='node1', port=7480,
        is_secure=False, calling_format=boto.s3.connection.OrdinaryCallingFormat(),
       )

bucket = conn.create_bucket('my-new-bucket')
for bucket in conn.get_all_buckets():
    print "{name} {created}".format(
        name=bucket.name,
        created=bucket.creation_date,
    )
[root@node1 my-cluster]# [root@node1 my-cluster]# python s3.py 
my-new-bucket 2022-01-21T01:39:11.186Z  # 创建bucket成功。

也可以使用s3 browser,进行bucket的创建删除及对象上传下载的测试。

5. 配置haproxy

[root@node1 my-cluster]# vim /etc/haproxy/haproxy.cfg
global
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     8000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/lib/haproxy/stats
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
#    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 8000

listen stats
   bind *:9090
   mode http
   stats enable
   stats uri /
   stats refresh 5s
   stats realm Haproxy\ Stats
   stats auth admin:admin

frontend nfs-in
    bind 192.168.40.64:2049
    mode tcp
    option tcplog
    default_backend         nfs-back

frontend s3-in
    bind 192.168.40.64:58080
    mode tcp
    option tcplog
    default_backend         s3-back

frontend dashboard-in
    bind 192.168.40.64:8888
    mode tcp
    option tcplog
    default_backend         dashboard-back

backend nfs-back
    balance     source
    mode        tcp
    log         /dev/log local0 debug
    server      node1   192.168.40.61:52049 check
    server      node2   192.168.40.62:52049 check
    server      node3   192.168.40.63:52049 check

backend s3-back
    balance     source
    mode        tcp
    log         /dev/log local0 debug
    server      node1   192.168.40.61:7480 check
    server      node2   192.168.40.62:7480 check

backend dashboard-back
    balance     source
    mode        tcp
    log         /dev/log local0 debug
    server      node1   192.168.40.61:8443 check  
    server      node2   192.168.40.62:8443 check
    server      node3   192.168.40.63:8443 check

[root@node1 my-cluster]# systemctl start haproxy
[root@node1 my-cluster]# systemctl enable haproxy
[root@node1 my-cluster]# scp /etc/haproxy/haproxy.cfg root@node2:/etc/haproxy/haproxy.cfg
[root@node1 my-cluster]# ssh root@node2 systemctl start haproxy
[root@node1 my-cluster]# ssh root@node2 systemctl enable haproxy

6. 配置keepalived

配置抢占式keepalived,node1为MASTER,优先级200,node2/3为BACKUP,优先级分别为150、100

[root@node1 my-cluster]# vim /etc/keepalived/keepalived.conf
global_defs {
   router_id CEPH_NFS  # 标识信息,随便写;
}

vrrp_script check_haproxy {  # 要执行的脚本
    script "killall -0 haproxy"
    weight -20
    interval 2
    rise 2
    fall 2
}

vrrp_instance VI_0 {
    state MASTER  # node1为MASTER,其余节点要修改为BACKUP
    priority 200  # 优先级,优先级高的优先获取VIP,其余两个节点分别设置150和100
    interface ens192  # 定义网络接口
    virtual_router_id 51  # 三个节点都要是51
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    virtual_ipaddress {
        192.168.40.64/24 dev ens192  # 虚拟IP
    }
    track_script {
        check_haproxy
    }
}

[root@node1 my-cluster]# systemctl start keepalived
[root@node1 my-cluster]# systemctl enable keepalived
记得修改节点2/3的keepalived配置
[root@node1 my-cluster]# ssh root@node2 systemctl start keepalived
[root@node1 my-cluster]# ssh root@node2 systemctl enable keepalived
[root@node1 my-cluster]# ssh root@node3 systemctl start keepalived
[root@node1 my-cluster]# ssh root@node3 systemctl enable keepalived

7. 测试、验证高可用

  • NFS

找一台测试机,将nfs1挂载,写入一些文件。这个时候虚拟IP是在node1上的。
将node1重启,会发现正在写入的文件会卡死进程,这很正常,因为用的不是nfs v4.1,无法将会话(session)继续。
Ctrl C断掉卡死的进程,重新写入,能够正常写入,说明VIP正常迁移了,nfs达到了最基本的高可用。

  • s3

通过s3 browser上传文件,重启node1,上传进程会卡几秒,之后上传继续。
S3服务达到了基本的高可用。