Ansible Best Practices

Author: Haohao Zhang

Summary

In order to manager thousands of servers , we need adeployment tool to do all kinds of things.

The most used tools are puppet, saltstack , ansible .

Puppet and saltstack both have agent , but ansible donothave agent which is the advantage , because you donot have to manage theseagents using another tool.

Also ansible is written in python language , have lots ofmodules .

You could develop your own modules and contribute back tocommunity.

Ansible use ssh protocal to transfer data .

Here are some best practices that we want to share withyou .

Practice1

Problem:

Result is output to terminal after you execute theansible or ansible-playbook command, sometimes you want it to run in thebackground , and output the result to log . you might use “nohup” to do it ,but you will find it is a disaster .

nohup   ansible-playbook -i inventory main.yml  -k -K -U root -u test -s -f 10  > ansible_log


Output:

  File "/usr/lib/python2.6/site-packages/ansible/runner/connection_plugins/ssh.py", line 162, in _communicate
    rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)
ValueError: filedescriptor out of range in select()


Reason:

The python client uses select() to wait forsocket activity. select() is used because it is available on most platforms.However, select() has a hard limit on 

the value of an file descriptor. If a socket iscreated that has a file descriptor > the value of FD_SETSIZE, the followingexception is thrown:

Note well: this is caused by the value of the fd,not the number of open fds.

Related issues in community:

https://github.com/ansible/ansible/issues/10157

https://issues.apache.org/jira/browse/QPID-5588

https://github.com/ansible/ansible/issues/14143

Reproduce:

cat test.py

#!/usr/bin/env python
import subprocess
import os
import select
import getpass
host = 'example.com'
timeout = 5
password = getpass.getpass(prompt="Enter your password:")
for i in range(10):
        (r, w) = os.pipe()
        ssh_cmd = ['sshpass', '-d%d ' % r]
        ssh_cmd += ['ssh', '%s' % host, 'uptime']
        print ssh_cmd
        os.write(w, password + '\n')
        os.close(w)
        os.close(r)
        p = subprocess.Popen(ssh_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        rpipes = [p.stdout, p.stderr]
        print "file descriptor: %r" % [x.fileno() for x in rpipes]
        rfd, wfd, efd = select.select([p.stdout, p.stderr], [], [p.stderr], timeout)
        if rfd:
                print p.stdout.read()
                print p.stderr.read()
        p.stdout.close()
        p.stderr.close()


Solution:

Using nohup to run ansible-playbook command will resultfile descriptors leak problem .

The right way to do it:

or you could leverage “screen” to keep the session .

ansible-playbook -i inventory main.yml  -k -K -U root -u test -s -f 10 >ansible_log 2>&1 </dev/null

Practice2

Problem:

Imagine that we will make a change to hadoopconfiguration files , then restart hadoop service .

but we donot want to restart the whole cluster at once ,we need rolling restart .

let’s say 10 servers for a batch.

How to we do this?

Solution:

Add "serial: $NUM" to the main playbook .sample:

---
- hosts: upgrade_rack
  gather_facts: yes
  vars_files:
    - vars.yml
  pre_tasks:
    - include: turn_off_monitor.yml
  tasks:
    - include: pre_upgrade.yml
    - include: upgrade.yml
    - include: post_upgrade.yml
  post_tasks:
    - include: turn_on_monitor.yml
  serial: 10


cat main.yml

Practice3

Problem:

We know that ansible to used to deploy things to remotehosts , but sometimes we want to login to other servers and do something when runningthe playbook tasks.

How do we do this?


Solution:

Ansible provides "delegate_to" feature to do this. sample 

Ansible just turn over to delegated hosts to executecommand , after that it turns back .

- name: "Refresh nodes on resourcemanager"
  shell: "yarn rmadmin -refreshNodes"
  delegate_to: "example.com"

Practice4

Problem:

Sometimes we need to make changes on files when usingansible ,ansible provides some modules to do this , like “lineinfile” , “replace”, “blockinfile” . 

Let's think a little more complex , assume we use module“replace” to modify a configuration file on the same server with forks 10 .

What will happen ?

We could imagine the configuration file will be messed up, because it is written by multiple processes at the same time .

- name: "Add hosts into mapred-exclude"
  replace: dest=mapred-exclude regexp='\Z' replace='{{inventory_hostname}}\n' owner=hadoop group=hadoop mode=644 backup=yes
  delegate_to: "example.com"


Solution:

We could add lock in the source code of module “replace”.and release thefile lock after write to file .

f = open(dest, 'rb+')
fcntl.flock(f, fcntl.LOCK_EX)
contents = f.read()
result = do_something_to_contents
f.seek(0)
f.write(result[0])
f.truncate()
fcntl.flock(f, fcntl.LOCK_UN)
f.close()

Practice5

Problem:

When running against a batch of hosts withansible-playbook , we often met following error in “gather facts” step :

failed: [example.com] => {"cmd": "/bin/lsblk -ln --output UUID /dev/sdn1", "failed": true, "rc": 257}
msg: Traceback (most recent call last):
 …………………
TimeoutError: Timer expired


Solution:

The reason is“timeout” for get_mount_facts functionin  /usr/lib/python2.6/site-packages/ansible/module_utils/facts.py is hardcoded to 10 seconds .

Hadoop nodes often have high IO , so disks may delay toresponse , so 10 seconds is not enough .

This problem have been fixed in ansible2.2 withintroducing a parameter gather_timeout .





  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值