谈到一个中间件,肯定要先学会如何应用(锁,leader选举),所以先读下应用实现方的源码:
curator:封装了zookeeper原生方式,相比于zkClient 抽取的更高
1、curator 实现分布式锁:
实现方式:
CuratorFramework curatorFramework = CuratorFrameworkFactory.builder()
.connectString(CONNECTION_STR)
.retryPolicy(new RetryForever(3))
.sessionTimeoutMs(3000)
.build();
curatorFramework.start();
//创建锁
InterProcessMutex lock = new InterProcessMutex(curatorFramework, "/lock");
try
{
lock.acquire(); //获取锁
System.out.println(Thread.currentThread().getName() + "获取锁成功");
Thread.sleep(300000);
lock.release(); // 释放锁
System.out.println(Thread.currentThread().getName() + "释放锁成功");
}
catch (Exception e)
{
e.printStackTrace();
}
来一把curator源码,其实还是比较简单的:
InterProcessMutex的构造方法本质就是构造了一个内部类
获取锁 lock.acquire() : acquire -> internalLock -> attemptLock -> createsTheLock、internalLockLoop ->getsTheLock
//获取锁,api其实模仿jdk的锁实现
public void acquire() throws Exception
{
if ( !internalLock(-1, null) )
{
throw new IOException("Lost connection while trying to acquire lock: " + basePath);
}
}
==>internalLock
private boolean internalLock(long time, TimeUnit unit) throws Exception
{
/*
Note on concurrency: a given lockData instance
can be only acted on by a single thread so locking isn't necessary
*/
Thread currentThread = Thread.currentThread();
LockData lockData = threadData.get(currentThread);
// 利用一个concurrentHashMap 实现锁的可重入
if ( lockData != null )
{
// re-entering 重入
lockData.lockCount.incrementAndGet();
return true;
}
// 核心方法 attemptLock
String lockPath = internals.attemptLock(time, unit, getLockNodeBytes());
if ( lockPath != null )
{
LockData newLockData = new LockData(currentThread, lockPath);
threadData.put(currentThread, newLockData);
return true;
}
return false;
}
===> attemptLock
String attemptLock(long time, TimeUnit unit, byte[] lockNodeBytes) throws Exception
{
final long startMillis = System.currentTimeMillis();
final Long millisToWait = (unit != null) ? unit.toMillis(time) : null;
final byte[] localLockNodeBytes = (revocable.get() != null) ? new byte[0] : lockNodeBytes;
int retryCount = 0;
String ourPath = null;
boolean hasTheLock = false;
boolean isDone = false;
while ( !isDone )
{
isDone = true;
try
{
// 创建临时有序节点,用来是预置锁信息
ourPath = driver.createsTheLock(client, path, localLockNodeBytes);
//根据临时有序的index和maxLease,给做小的获取锁
hasTheLock = internalLockLoop(startMillis, millisToWait, ourPath);
}
catch ( KeeperException.NoNodeException e )
{
// gets thrown by StandardLockInternalsDriver when it can't find the lock node
// this can happen when the session expires, etc. So, if the retry allows, just try it all again
if (
client.getZookeeperClient().getRetryPolicy().allowRetry(retryCount++,
System.currentTimeMillis() - startMillis,
RetryLoop.getDefaultRetrySleeper()) )
{
isDone = false;
}
else
{
throw e;
}
}
}
if ( hasTheLock )
{
return ourPath;
}
return null;
}
===> createsTheLock 创建临时有序节点 EPHEMERAL_SEQUENTIAL
public String createsTheLock(CuratorFramework client, String path, byte[] lockNodeBytes) throws Exception
{
String ourPath;
if ( lockNodeBytes != null )
{
ourPath = client.create().creatingParentContainersIfNeeded().withProtection().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).forPath(path, lockNodeBytes);
}
else
{
ourPath = client.create().creatingParentContainersIfNeeded().withProtection().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).forPath(path);
}
return ourPath;
}
===> internalLockLoop 内部获取锁
private boolean internalLockLoop(long startMillis, Long millisToWait, String ourPath) throws Exception
{
boolean haveTheLock = false;
boolean doDelete = false;
try
{
if ( revocable.get() != null )
{
client.getData().usingWatcher(revocableWatcher).forPath(ourPath);
}
while ( (client.getState() == CuratorFrameworkState.STARTED) && !haveTheLock )
{
// 按照序列的名称排序
List<String> children = getSortedChildren();
// 截取最后的序列名称
String sequenceNodeName =
ourPath.substring(basePath.length() + 1); // +1 to include the slash
// 根据sequeneNodeName 在list中的位置
PredicateResults predicateResults = driver.getsTheLock(client, children, sequenceNodeName, maxLeases);
if ( predicateResults.getsTheLock() )
{
haveTheLock = true; //获取到的直接设置
}
else
{
//没获取到锁,需要等待获取
String previousSequencePath = basePath + "/" + predicateResults.getPathToWatch();
synchronized(this)
{
try
{
// use getData() instead of exists() to avoid leaving unneeded watchers which is a type of resource leak
client.getData().usingWatcher(watcher).forPath(previousSequencePath);
if ( millisToWait != null )
{
//休眠指定时间
millisToWait -= (System.currentTimeMillis() - startMillis);
startMillis = System.currentTimeMillis();
if ( millisToWait <= 0 )
{
doDelete = true; // timed out - delete our node
break;
}
wait(millisToWait);
}
else
{
//直接等待,会在注册的监听事件触发时(锁节点发生变化) notifyAll
wait();
}
}
catch ( KeeperException.NoNodeException e )
{
// it has been deleted (i.e. lock released). Try to acquire again
}
}
}
}
}
catch ( Exception e )
{
ThreadUtils.checkInterrupted(e);
doDelete = true;
throw e;
}
finally
{
if ( doDelete )
{
deleteOurPath(ourPath);
}
}
return haveTheLock;
}
===> getsTheLock
@Override
public PredicateResults getsTheLock(CuratorFramework client, List<String> children, String sequenceNodeName, int maxLeases) throws Exception
{
int ourIndex = children.indexOf(sequenceNodeName);
validateOurIndex(sequenceNodeName, ourIndex);
boolean getsTheLock = ourIndex < maxLeases;
String pathToWatch = getsTheLock ? null : children.get(ourIndex - maxLeases);
return new PredicateResults(pathToWatch, getsTheLock);
}
释放锁 release
public void release() throws Exception
{
/*
Note on concurrency: a given lockData instance
can be only acted on by a single thread so locking isn't necessary
*/
Thread currentThread = Thread.currentThread();
LockData lockData = threadData.get(currentThread);
if ( lockData == null )
{
throw new IllegalMonitorStateException("You do not own the lock: " + basePath);
}
//因为可重入锁,所以删除的时候,需要全部都释放
int newLockCount = lockData.lockCount.decrementAndGet();
if ( newLockCount > 0 )
{
return;
}
if ( newLockCount < 0 )
{
throw new IllegalMonitorStateException("Lock count has gone negative for lock: " + basePath);
}
try
{
// 释放
internals.releaseLock(lockData.lockPath);
}
finally
{
threadData.remove(currentThread);
}
}
====> releaseLock
void releaseLock(String lockPath) throws Exception
{
revocable.set(null);
deleteOurPath(lockPath);
}
//本质
private void deleteOurPath(String ourPath) throws Exception
{
try
{
client.delete().guaranteed().forPath(ourPath); //删除节点信息
}
catch ( KeeperException.NoNodeException e )
{
// ignore - already deleted (possibly expired session, etc.)
}
}
总结一下: 获取锁,就是在指定路径,注册一个临时有序节点,根据maxLease(默认是1) 只取索引最小的有序节点,是获取锁。其他所有的节点,都wait() 等待这个锁的节点是的监听事件,触发的notifyAll,来重新竞争锁
释放锁:当所有的重入锁全都扣减完成,直接删除对应的临时有序节点,触发节点变化事件,唤醒等待的锁。
大概说了一下curator获取锁的流程,zookeeper除了实现分布式锁,还可以提供为分布式服务,leader选举也是(kafka内部就是使用zookeeper来实现的 controller 选举)
应用
public class LeaderSelectorClient extends LeaderSelectorListenerAdapter implements Closeable {
@Override
public void takeLeadership(CuratorFramework client) throws Exception {
//如果进入当前的方法,意味着当前的进程获得了锁。获得锁以后,这个方法会被回调
//这个方法执行结束之后,表示释放leader权限
System.out.println(name+"->现在是leader了");
// countDownLatch.await(); //阻塞当前的进程防止leader丢失
}
public static void main(String[] args) throws IOException {
CuratorFramework curatorFramework = CuratorFrameworkFactory.builder().
connectString(CONNECTION_STR).sessionTimeoutMs(50000000).
retryPolicy(new ExponentialBackoffRetry(1000, 3)).build();
curatorFramework.start();
LeaderSelectorClient leaderSelectorClient=new LeaderSelectorClient("ClientA");
LeaderSelector leaderSelector=new
LeaderSelector(curatorFramework,"/leader",leaderSelectorClient);
leaderSelectorClient.setLeaderSelector(leaderSelector);
leaderSelectorClient.start(); //开始选举
System.in.read();
}
}
核心方法: leaderSelector.start()
主要代码:
requeue(); -> internalRequeue ->
void doWork() throws Exception
{
hasLeadership = false;
try
{
//leader 选举其实就是竞争锁,拿到锁的就是leader
mutex.acquire();
hasLeadership = true;
try
{
if ( debugLeadershipLatch != null )
{
debugLeadershipLatch.countDown();
}
if ( debugLeadershipWaitLatch != null )
{
debugLeadershipWaitLatch.await();
}
//获取锁的线程,进入我们复写的方法,这个方法走完,就会释放锁,可以在这里阻塞
listener.takeLeadership(client);
}
catch ( InterruptedException e )
{
Thread.currentThread().interrupt();
throw e;
}
catch ( Throwable e )
{
ThreadUtils.checkInterrupted(e);
}
finally
{
clearIsQueued();
}
}
catch ( InterruptedException e )
{
Thread.currentThread().interrupt();
throw e;
}
finally
{
if ( hasLeadership )
{
hasLeadership = false;
try
{
mutex.release(); //释放锁
}
catch ( Exception e )
{
ThreadUtils.checkInterrupted(e);
log.error("The leader threw an exception", e);
// ignore errors - this is just a safety
}
}
}
}
总结: curator利用zk实现leader选举,其实就是利用的分布式锁的机制, 第一个注册的就是leader。当takeLeaderShip方法结束后,leader节点会被释放。
zab协议:
zookeeper借鉴了googleChuby的 paxous协议,来实现提供分布式一致性服务。
zab协议总共提供了两种模式:
崩溃恢复 消息广播
消息广播:
实质就是一个2pc,很多框架也有这种思想(例如mysql的redo log)
步骤: 1、leader收到了事物请求,本地记录事物日志,并生成一个自增的zxid(前32位是epoche,后32位是自增数)
2、将携带zxid的事物请求,发送一个proposal请求给所有的follow
3、follow节点记录proposal到本地,构建提交的准备工作后,发送一个ack回执给leader
4、leader收到过半的ack,提交本地事物,并发送commit给所有的follow
5、follow收到commit请求,提交事物,完成同步
崩溃恢复:
做两件事: 1、选出leader 2、做数据同步
leader选举的源码:
找到程序入口;
打开zkServer.sh
1、main.initializeAndRun(args);
2、
加载配置文件到config里面
QuorumPeerConfig config = new QuorumPeerConfig();
if (args.length == 1) {
config.parse(args[0]);
}
如果配置文件有值,就
if (args.length == 1 && config.servers.size() > 0) {
runFromConfig(config);//如果args==1,走这段代码
}
3、 public void runFromConfig(QuorumPeerConfig config) throws IOException {
try {
ManagedUtil.registerLog4jMBeans();
} catch (JMException e) {
LOG.warn("Unable to register log4j JMX control", e);
}
LOG.info("Starting quorum peer");
try {
//构建cnxnFactor对象,用来创建服务端连接信息
ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory();
cnxnFactory.configure(config.getClientPortAddress(),
config.getMaxClientCnxns());
quorumPeer = getQuorumPeer();
//getView()
quorumPeer.setQuorumPeers(config.getServers()); //zoo.cfg里面解析的servers节点
quorumPeer.setTxnFactory(new FileTxnSnapLog(
new File(config.getDataLogDir()),
new File(config.getDataDir())));
quorumPeer.setElectionType(config.getElectionAlg());
quorumPeer.setMyid(config.getServerId());
quorumPeer.setTickTime(config.getTickTime());
.......
quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
quorumPeer.initialize();
quorumPeer.start(); //上面全都是创建quorumpeer对象,这里是核心
quorumPeer.join(); // 线程获得cpu使用权
} catch (InterruptedException e) {
// warn, but generally this is ok
LOG.warn("Quorum Peer interrupted", e);
}
}
4、加载初始化信息
public synchronized void start() {
loadDataBase(); //加载快照文件数据
cnxnFactory.start(); //Nio或者netty 跟通信有关系. ->暴露一个2181的端口号
startLeaderElection(); //开始leader选举-> 启动一个投票的监听、初始化一个选举算法
super.start(); //当前的QuorumPeer继承Thread,调用Thread.start() ->QuorumPeer.run()
}
5、start方法是线程的方法,最终会执行run
public void run() {
....
try {
/*
* Main loop
* 死循环
*/
while (running) {
switch (getPeerState()) {//第一次启动的时候,LOOKING
case LOOKING:
LOG.info("LOOKING");
if (Boolean.getBoolean("readonlymode.enabled")) {
.....readonly.....
} else {
try {
setBCVote(null);
//setCurrentVote -> 确定了谁是leader了。
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
}
}
break;
case OBSERVING:
......
break;
case FOLLOWING:
......
break;
case LEADING:
......
break;
}
}
}
}
6、lookForLeader()最终获取到leader,这是关键
public Vote lookForLeader() throws InterruptedException {
try {
//接收到的票据的集合
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
//
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = finalizeWait;
synchronized(this){
//逻辑时钟->epoch
logicalclock.incrementAndGet();
//proposal
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
sendNotifications();//我要广播自己的票据,将自己的vote发送到队列发出去
/*
* Loop in which we exchange notifications until we find a leader
*/
//接收到了票据
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
//recvqueue是从网络上接收到的其他机器的Notification
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
收到为空,重新发送一次,防止网络问题
*/
if(n == null){
if(manager.haveDelivered()){
sendNotifications();
} else {
manager.connectAll();//重新连接集群中的所有节点
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
}
else if(validVoter(n.sid) && validVoter(n.leader)) {//判断是否是一个有效的票据
/*
* Only proceed if the vote comes from a replica in the
* voting view for a replica in the voting view.
*/
switch (n.state) {
case LOOKING: //第一次进入到这个case
// If notification > current, replace and send messages out
// 当收到的epoche > 当前,选对方为leader
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch);
recvset.clear();//清空
//收到票据之后,当前的server要听谁的。
//可能是听server1的、也可能是听server2,也可能是听server3
//zab leader选举算法
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
//把自己的票据更新成对方的票据,那么下一次,发送的票据就是新的票据
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
//收到的票据小于当前的节点的票据,下一次发送票据,仍然发送自己的
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
//继续发送通知
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) { //说明当前的数据已经过期了
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//决断时刻(当前节点的更新后的vote信息,和recvset集合中的票据进行归纳,)
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid,
logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if(n.electionEpoch == logicalclock.get()){
recvset.put(n.sid, new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch));
if(ooePredicate(recvset, outofelection, n)) {
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify
* a majority is following the same leader.
*/
outofelection.put(n.sid, new Vote(n.version,
n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch,
n.state));
if(ooePredicate(outofelection, outofelection, n)) {
synchronized(this){
logicalclock.set(n.electionEpoch);
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
}
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
n.state, n.sid);
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
LOG.debug("Number of connection processing threads: {}",
manager.getConnectionThreadCount());
}
}
7、两张票进行battle的方法 totalOrderPredicate
核心逻辑就是: 先比epoch 相等再比zxid 还是相等,再比myid(server.id,这个是自己设置的,不可能再相等了)
((newEpoch > curEpoch) ||
((newEpoch == curEpoch) &&
((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
8、什么情况下终止投票,大于一般就是leader termPredicate()
(set.size() > half); 当某一个节点的票据,大于一般的时候,直接胜出
上面大概就是zookeeper自己,如何选举出leader的。
思想就是:第一轮,先投自己,然后互相比,赢得一方,就获得其他节点的投票,过半直接就是leader节点。
网络交互图:
send流程: 1、发送自己的vote : sendqueue.offer(notmsg)
2、WorkerSender 一直监听,获取数据 sendqueue.poll()
3、根据sid判断,发给自己还是别人,自己直接添加receiveQueue,否则走socket通信 queueSendMap
4、SendWorker. start();
5、write flush
receive流程:
1、RecvWorker run() 监听
2、recvQueue.add(msg); 消息放到recvQueue中
3、在选举leader时,会 recvqueue.poll,拿到信息,出来对比
上面我们看了Looking的状态下的操作,leader选举完了以后,还有一些其他操作,简单看下:
makeFollower: 构建FollowerZooKeeperServer
follower.followLeader():
关键代码:
connectToLeader(leaderServer.addr, leaderServer.hostname); 连接到leader,构建socket通信
syncWithLeader(newEpochZxid); 同步Leader数据
同步数据对于follower来说,就是针对leader发送的不同命令,做不同处理
服务端如何处理客户端发来的请求的?
请求流程:当serverCnx建立了socket连接后,监听读和写事件,来进行相应操作。
上面已经获取到了请求的buffer信息,下面开始处理
// 接收到请求后
Request si = new Request(cnxn, cnxn.getSessionId(), h.getXid(),
h.getType(), incomingBuffer, cnxn.getAuthInfo());
si.setOwner(ServerCnxn.me);
submitRequest(si); // 提交请求
public void submitRequest(Request si) {
......
firstProcessor.processRequest(si);
.....
}
firstProcessor的生成:
protected void setupRequestProcessors() {
// 构建链信息:
RequestProcessor finalProcessor = new FinalRequestProcessor(this);
RequestProcessor syncProcessor = new SyncRequestProcessor(this,
finalProcessor);
((SyncRequestProcessor)syncProcessor).start();
firstProcessor = new PrepRequestProcessor(this, syncProcessor);
((PrepRequestProcessor)firstProcessor).start();
}
从这里我们可以得到,firstProcessor处理请求,其实是链式调用,上面初始化了构建过程:
PrepRequestProcessor(SyncRequestProcessor(FinalRequestProcessor))
PrepRequestProcessor: 主要是序列化header请求,将zxid cversion empireOwner等stat信息都创建好
SyncRequestProcessor: 主要是将请求信息全部都写到snap快照日志中,完成同步(zk数据全在内存)、
FinalRequestProcessor:根据请求的类型(create、delete、setData...),完成事物提交,并触发watch
纯手绘的一张zk的源码图,可能会有偏差,大致方向应该是没问题,可以参考一下