Cassandra LCS压缩原理详解

cassandra的压缩的策略是在cassandra的守护线程cassandraDaemon类中的startUp中进行定时启动的压缩机制。

CassandraDaemon setUp()中的定时启动任务

ScheduledExecutors.optionalTasks.scheduleWithFixedDelay(ColumnFamilyStore.getBackgroundCompactionTaskSubmitter(), 5, 1, TimeUnit.MINUTES);

从代码中可以看出,cassandra是启动5分钟以后每隔1分钟就要启动一次压缩任务

public static Runnable getBackgroundCompactionTaskSubmitter()
{
    return new Runnable()
    {
        public void run()
        {
            for (Keyspace keyspace : Keyspace.all())
                for (ColumnFamilyStore cfs : keyspace.getColumnFamilyStores())
                    CompactionManager.instance.submitBackground(cfs);
        }
    };
}

 

 

/**
 * Call this whenever a compaction might be needed on the given columnfamily.
 * It's okay to over-call (within reason) if a call is unnecessary, it will
 * turn into a no-op in the bucketing/candidate-scan phase.
 */
public List<Future<?>> submitBackground(final ColumnFamilyStore cfs)
{
    if (cfs.isAutoCompactionDisabled()) // 判断表格是否关闭了压缩策略
    {
        logger.trace("Autocompaction is disabled");
        return Collections.emptyList();
    }
    
     /**
     * 如果CF当前正在被压缩了,并且没有闲置的线程池了,我们则等待下一次提交当前的CF压缩任务,当我们有足够多线程的时候
     * 否则我们应该至少提交一个任务以防止某个CF长时间霸占线程池,也就是CF饥饿。
     **/
    int count = compactingCF.count(cfs);
    if (count > 0 && executor.getActiveCount() >= executor.getMaximumPoolSize())
    { // 已经有在压缩了,并且没有空间的线程池,则退出
        logger.trace("Background compaction is still running for {}.{} ({} remaining). Skipping",
                     cfs.keyspace.getName(), cfs.name, count);
        return Collections.emptyList();
    }

    logger.trace("Scheduling a background task check for {}.{} with {}",
                 cfs.keyspace.getName(),
                 cfs.name,
                 cfs.getCompactionStrategyManager().getName());

    List<Future<?>> futures = new ArrayList<>(1);
    Future<?> fut = executor.submitIfRunning(new BackgroundCompactionCandidate(cfs), "background task");
    //没有正在压缩的,情况,则提交一次压缩,以防止CF 饥饿
    if (!fut.isCancelled())
        futures.add(fut);
    else
        compactingCF.remove(cfs);
    return futures;
}

public void run()
{
    try
    {
        logger.trace("Checking {}.{}", cfs.keyspace.getName(), cfs.name);
        if (!cfs.isValid()) // 如果已经删除了,则不允许在被压缩了
        {
            logger.trace("Aborting compaction for dropped CF");
            return;
        }
        
        //先从cf表中获取到当前表格的压缩策略
        CompactionStrategyManager strategy = cfs.getCompactionStrategyManager();
        //根据压缩策略,获取到压缩任务,这里需要获取到GC的时间,这里的GC是指墓碑的删除时间
        AbstractCompactionTask task = strategy.getNextBackgroundTask(getDefaultGcBefore(cfs, FBUtilities.nowInSeconds()));
        if (task == null)
        {
            logger.trace("No tasks available");
            return;
        }
        task.execute(metrics);
    }
    finally
    {
        compactingCF.remove(cfs);
    }
    submitBackground(cfs);
}

 


/**
 * Return the next background task
 *
 * Returns a task for the compaction strategy that needs it the most (most estimated remaining tasks)
 *
 */
public synchronized AbstractCompactionTask getNextBackgroundTask(int gcBefore)
{
    if (!isEnabled())
        return null;

    maybeReload(cfs.metadata);
    
    // 将任务分为已经repaired过的,和没有进行repaired的两部分
    // 哪个预估剩余的任务量大,就先进行哪个任务
    if (repaired.getEstimatedRemainingTasks() > unrepaired.getEstimatedRemainingTasks())
    {
        AbstractCompactionTask repairedTask = repaired.getNextBackgroundTask(gcBefore);
        if (repairedTask != null)
            return repairedTask;
        return unrepaired.getNextBackgroundTask(gcBefore);
    }
    else
    {
        AbstractCompactionTask unrepairedTask = unrepaired.getNextBackgroundTask(gcBefore);
        if (unrepairedTask != null)
            return unrepairedTask;
        return repaired.getNextBackgroundTask(gcBefore);
    }
}

 


/**
 * the only difference between background and maximal in LCS is that maximal is still allowed
 * (by explicit user request) even when compaction is disabled.
 */
@SuppressWarnings("resource")
public synchronized AbstractCompactionTask getNextBackgroundTask(int gcBefore)
{
    while (true)
    {
        OperationType op;
        //获取到压缩的候选者
        LeveledManifest.CompactionCandidate candidate = manifest.getCompactionCandidates();
        if (candidate == null)
        {  // 如果没有压缩候选者,也就是候选者为null
            // 这个时候,没有压缩候选者,那么就尝试针对已经删除的数据,也就是墓碑是否有需要被处理的
            SSTableReader sstable = findDroppableSSTable(gcBefore);
            if (sstable == null)
            {
                logger.trace("No compaction necessary for {}", this);
                return null;
            }
            candidate = new LeveledManifest.CompactionCandidate(Collections.singleton(sstable),
                                                                sstable.getSSTableLevel(),
                                                                getMaxSSTableBytes());
            op = OperationType.TOMBSTONE_COMPACTION;
        }
        else
        {
            op = OperationType.COMPACTION;
        }

        LifecycleTransaction txn = cfs.getTracker().tryModify(candidate.sstables, OperationType.COMPACTION);
        if (txn != null)
        {
            // 返回分层压缩任务
            LeveledCompactionTask newTask = new LeveledCompactionTask(cfs, txn, candidate.level, gcBefore, candidate.maxSSTableBytes, false);
            newTask.setCompactionType(op);
            return newTask;
        }
    }
}

/**
 * @return highest-priority sstables to compact, and level to compact them to
 * If no compactions are necessary, will return null
 */
public synchronized CompactionCandidate getCompactionCandidates()
{
    // during bootstrap we only do size tiering in L0 to make sure
    // the streamed files can be placed in their original levels
    if (StorageService.instance.isBootstrapMode())
    {
        List<SSTableReader> mostInteresting = getSSTablesForSTCS(getLevel(0));
        if (!mostInteresting.isEmpty())
        {
            logger.info("Bootstrapping - doing STCS in L0");
            return new CompactionCandidate(mostInteresting, 0, Long.MAX_VALUE);
        }
        return null;
    }
    // LevelDB 会给每个level 一个分数(有多少数据它拥有的比上它的理想数据),并且
    // 压缩得分高的层级,但是这样很容以分崩离析,一旦发生落后
    // 举个例子,现在L0 有 988个sstable,理想的是4个
    // L1 117个sstable,理想的是10个
    // L2 12个sstable,理想的是100个
    // 问题就是当L0(225) 比 L1(11)要高,那么我们会做一个MAX_COMPACTION_SIZE的L0 和 117个L1压缩
    // 并将压缩的结果放到L1,当我们计算下一个L0的时候,又需要一次和L1(120)个sstable一起做压缩
    // 这样就会导致L1不停的被压缩,引起频繁的IO读取,而且是指针对L1的。
    // 这种压缩策略,一但L0的压缩落后了以后,我们就不得不阻塞写性能
    // 因此我们采用不同的策略
    // 1. 首先先压缩高层,这样可以最大限度的减少IO
    // 2. 并且L0一旦落后比较严重了,会采用SIZE压缩,以减少读性能,从而赶上高层的压缩分数
    // 当然这不是一个万全之策,如果一直处于高压的写,也同样会崩溃,但是偶尔爆发性的写,这是一个很好的策略
    for (int i = generations.length - 1; i > 0; i--)
    {
        List<SSTableReader> sstables = getLevel(i);
        if (sstables.isEmpty())
            continue; // mostly this just avoids polluting the debug log with zero scores
        // we want to calculate score excluding compacting ones
        Set<SSTableReader> sstablesInLevel = Sets.newHashSet(sstables);
        Set<SSTableReader> remaining = Sets.difference(sstablesInLevel, cfs.getTracker().getCompacting());
        // 分数为  sstable的总的大小 /  该层级最大的磁盘空间
        double score = (double) SSTableReader.getTotalBytes(remaining) / (double)maxBytesForLevel(i, maxSSTableSizeInBytes);
        logger.trace("Compaction score for level {} is {}", i, score);

        if (score > 1.001) // 当分数大于1的时候,也就是当前层级的大小比当前c层级最大的允许的磁盘空间
        {
            // 在处理高层级压缩的时候,就需要判断一下L0的层级分数是否落后到足够多以至于开启STCS的压缩
            // before proceeding with a higher level, let's see if L0 is far enough behind to warrant STCS
            CompactionCandidate l0Compaction = getSTCSInL0CompactionCandidate();
            if (l0Compaction != null) // 如果L0 已经落后太多了,开启STCS压缩
                return l0Compaction;

            // L0当前还好,就直接执行当前的压缩策略
            // L0 is fine, proceed with this level
            Collection<SSTableReader> candidates = getCandidatesFor(i);
            if (!candidates.isEmpty())
            {
                int nextLevel = getNextLevel(candidates);
                // 将它的上一级的压缩次数清0,并且判断是否存在饥饿压缩的情况,如果是的话,就要考虑一下 sstable是否和候选者之间存在重叠,并且没有在压缩
                // 则也需要一起加进来就行一起压缩,这主要原因是因为有些层级数据量太少了,一直灭有被压缩过
                candidates = getOverlappingStarvedSSTables(nextLevel, candidates);
                if (logger.isTraceEnabled())
                    logger.trace("Compaction candidates for L{} are {}", i, toString(candidates));
                return new CompactionCandidate(candidates, nextLevel, cfs.getCompactionStrategyManager().getMaxSSTableBytes());
            }
            else
            {
                logger.trace("No compaction candidates for L{}", i);
            }
        }
    }

    // Higher levels are happy, time for a standard, non-STCS L0 compaction
    if (getLevel(0).isEmpty())
        return null;
    Collection<SSTableReader> candidates = getCandidatesFor(0);
    if (candidates.isEmpty())  // 如果获取到的L0层级的压缩候选者数据量为0,则直接进行stcs压缩
    {
        // Since we don't have any other compactions to do, see if there is a STCS compaction to perform in L0; if
        // there is a long running compaction, we want to make sure that we continue to keep the number of SSTables
        // small in L0.
        return getSTCSInL0CompactionCandidate();
    }
    return new CompactionCandidate(candidates, getNextLevel(candidates), cfs.getCompactionStrategyManager().getMaxSSTableBytes());
}

 

/**
 * @return highest-priority sstables to compact for the given level.
 * If no compactions are possible (because of concurrent compactions or because some sstables are blacklisted
 * for prior failure), will return an empty list.  Never returns null.
 */
private Collection<SSTableReader> getCandidatesFor(int level)
{
    assert !getLevel(level).isEmpty();
    logger.trace("Choosing candidates for L{}", level);

    final Set<SSTableReader> compacting = cfs.getTracker().getCompacting();

    if (level == 0) // 如果是level为0就走level 0 的压缩策略
    {
        // 先要获取到L0正在压缩的sstable
        Set<SSTableReader> compactingL0 = getCompacting(0);

        // 首选,先要获取到L0 正在压缩的sstable中最大的 parttion 
        // 和最小的partion
        PartitionPosition lastCompactingKey = null;
        PartitionPosition firstCompactingKey = null;
        for (SSTableReader candidate : compactingL0)
        {
            if (firstCompactingKey == null || candidate.first.compareTo(firstCompactingKey) < 0)
                firstCompactingKey = candidate.first;
            if (lastCompactingKey == null || candidate.last.compareTo(lastCompactingKey) > 0)
                lastCompactingKey = candidate.last;
        }

        // L0 是很多新得sstable的垃圾场,因此可能会存在很多的sstable重叠
        // 我们对待L0的压缩比较特殊
        // 1. 添加sstables到 候选者集合中,直到至少最大的数量
        // 2. 优先选择老的sstable,而不是新的sstable,并且任意和候选者只有
        // 重叠的sstable也都会加入熬后选择中,当L0的sstable的数量大于Max的时候
        // 就会发起压缩
        // 如果所有的候选者的大小小于最大MB的时候,我们将不会打扰L1层,并
        // 将压缩后的结果保存到L0中,而不是直接提升。


        // L0 is the dumping ground for new sstables which thus may overlap each other.
        //
        // We treat L0 compactions specially:
        // 1a. add sstables to the candidate set until we have at least maxSSTableSizeInMB
        // 1b. prefer choosing older sstables as candidates, to newer ones
        // 1c. any L0 sstables that overlap a candidate, will also become candidates
        // 2. At most MAX_COMPACTING_L0 sstables from L0 will be compacted at once
        // 3. If total candidate size is less than maxSSTableSizeInMB, we won't bother compacting with L1,
        //    and the result of the compaction will stay in L0 instead of being promoted (see promote())
        //
        // Note that we ignore suspect-ness of L1 sstables here, since if an L1 sstable is suspect we're
        // basically screwed, since we expect all or most L0 sstables to overlap with each L1 sstable.
        // So if an L1 sstable is suspect we can't do much besides try anyway and hope for the best.
        Set<SSTableReader> candidates = new HashSet<>();
        Set<SSTableReader> remaining = new HashSet<>();
        //任何可疑的sstable
        Iterables.addAll(remaining, Iterables.filter(getLevel(0), Predicates.not(suspectP)));
        // 将剩余的可疑的sstable按照sstable生成的时间进行排序
        for (SSTableReader sstable : ageSortedSSTables(remaining))
        {
            // 如果已经在候选者中了,就直接跳过
            if (candidates.contains(sstable))
                continue;

            //剩余的sstable和当前的sstalec有重叠的部分也会被加如到候选者中
            // 这里的重叠指得时 sstable中得最大最小得token。也就是说
            // 任何sstable 和 当前得sstable得token之间存在交集,也就是范围存在交集
            // 这里可能认为token范围重叠,就存在内容重叠吧?
            Sets.SetView<SSTableReader> overlappedL0 = Sets.union(Collections.singleton(sstable), overlapping(sstable, remaining));
            if (!Sets.intersection(overlappedL0, compactingL0).isEmpty())
                continue;  // 如果所有重叠额sstable和当前得sstable一起,和正在压缩得sstable之间存在交集,则直接跳
            // 如果overlappedL0 没有正在压缩的sstable,则需要判断
            // 候选者中是否有和正在压缩的l0层sstable 有token范围交集
            // 如果没有交集,则认为当前的sstable就直接加入候选者
            // 
            for (SSTableReader newCandidate : overlappedL0)
            {
                if (firstCompactingKey == null || lastCompactingKey == null || overlapping(firstCompactingKey.getToken(), lastCompactingKey.getToken(), Arrays.asList(newCandidate)).size() == 0)
                    candidates.add(newCandidate);
                remaining.remove(newCandidate); // 已经经过重叠的sstable就不在进行重复添加了
                // 要么这个sstable 和 正在压缩的有重叠,要么已经加入到候选者,所以可以在剩余的sstable集合中直接删除
            }

            //如果候选者的数据已经大于MAX_COMPACTING_L0的时候,直接获取到时间最早的最大数据量的sstable
            if (candidates.size() > MAX_COMPACTING_L0)
            {
                // limit to only the MAX_COMPACTING_L0 oldest candidates
                candidates = new HashSet<>(ageSortedSSTables(candidates).subList(0, MAX_COMPACTING_L0));
                break;
            }
        }
        
        // 如果候选者加起来的sstable的大小比最大值的话,就需要加入L1层中的sstable进来一起压缩
        // leave everything in L0 if we didn't end up with a full sstable's worth of data
        if (SSTableReader.getTotalBytes(candidates) > maxSSTableSizeInBytes)
        {
            // add sstables from L1 that overlap candidates
            // if the overlapping ones are already busy in a compaction, leave it out.
            // TODO try to find a set of L0 sstables that only overlaps with non-busy L1 sstables
            // 候选者最大最小的tokenf范围内和L1有重叠的sstable
            Set<SSTableReader> l1overlapping = overlapping(candidates, getLevel(1));
            // L1重叠的sstable和正在压缩的sstable有重叠,则直接放弃当前的L0压缩
            if (Sets.intersection(l1overlapping, compacting).size() > 0)
                return Collections.emptyList();
            // 如果L0正在压缩的sstable 和 候选者之间存在token重叠的话,也直接放弃当前L0压缩    
            if (!overlapping(candidates, compactingL0).isEmpty())
                return Collections.emptyList();
            candidates = Sets.union(candidates, l1overlapping);
        }
        if (candidates.size() < 2)
            return Collections.emptyList();
        else
            return candidates;
    }

    // for non-L0 compactions, pick up where we left off last time
    Collections.sort(getLevel(level), SSTableReader.sstableComparator);
    int start = 0; // handles case where the prior compaction touched the very last range
    for (int i = 0; i < getLevel(level).size(); i++)
    {
        SSTableReader sstable = getLevel(level).get(i);
        if (sstable.first.compareTo(lastCompactedKeys[level]) > 0)
        {
            start = i;
            break;
        }
    }

    // look for a non-suspect keyspace to compact with, starting with where we left off last time,
    // and wrapping back to the beginning of the generation if necessary
    for (int i = 0; i < getLevel(level).size(); i++)
    {
        SSTableReader sstable = getLevel(level).get((start + i) % getLevel(level).size());
        Set<SSTableReader> candidates = Sets.union(Collections.singleton(sstable), overlapping(sstable, getLevel(level + 1)));
        if (Iterables.any(candidates, suspectP))
            continue;
        if (Sets.intersection(candidates, compacting).isEmpty())
            return candidates;
    }

    // all the sstables were suspect or overlapped with something suspect
    return Collections.emptyList();
}


private CompactionCandidate getSTCSInL0CompactionCandidate()
{
    if (!DatabaseDescriptor.getDisableSTCSInL0() && getLevel(0).size() > MAX_COMPACTING_L0)
    {
        List<SSTableReader> mostInteresting = getSSTablesForSTCS(getLevel(0));
        if (!mostInteresting.isEmpty())
        {
            logger.debug("L0 is too far behind, performing size-tiering there first");
            return new CompactionCandidate(mostInteresting, 0, Long.MAX_VALUE);
        }
    }

    return null;
}


private CompactionCandidate getSTCSInL0CompactionCandidate()
{
    //  如果开启了STCS压缩,并且L0的sstable 的总数大于 MAX量,则开启STCS压缩
    // 
    if (!DatabaseDescriptor.getDisableSTCSInL0() && getLevel(0).size() > MAX_COMPACTING_L0)
    {
        List<SSTableReader> mostInteresting = getSSTablesForSTCS(getLevel(0));
        if (!mostInteresting.isEmpty())
        {
            logger.debug("L0 is too far behind, performing size-tiering there first");
            return new CompactionCandidate(mostInteresting, 0, Long.MAX_VALUE);
        }
    }

    return null;
}

 

STCS的大小归类的方法是,比如1 2 sstable的平均大小作为一个值,这个值上上下 0.5倍也都加入

到这个sstable中,然后再进行求解 平均值,然后再重新计算上下值,加入到这个sstable中,进行重新编写大小。最后返回大小差不多的sstable加入到一起

然后比较所有大小差不多的 sstable 集合,之间所有的sstable的热度比较大小,返回热度最大的sstable集合进行压缩。因为读取越多的sstable,优先进行压缩,有利于提升读性能

private List<SSTableReader> getSSTablesForSTCS(Collection<SSTableReader> sstables)
{
    Iterable<SSTableReader> candidates = cfs.getTracker().getUncompacting(sstables);
    List<Pair<SSTableReader,Long>> pairs = SizeTieredCompactionStrategy.createSSTableAndLengthPairs(AbstractCompactionStrategy.filterSuspectSSTables(candidates));
    List<List<SSTableReader>> buckets = SizeTieredCompactionStrategy.getBuckets(pairs,
                                                                                options.bucketHigh,
                                                                                options.bucketLow,
                                                                                options.minSSTableSize);
    return SizeTieredCompactionStrategy.mostInterestingBucket(buckets, 4, 32);
}

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值