How to calculate bits per character of a string? (bpc) to read

 
 

A paper I was reading, http://www.cs.toronto.edu/~ilya/pubs/2011/LANG-RNN.pdf, uses bits per character as a test metric for estimating the quality of generative computer models of text but doesn't reference how it was calculated. Googling around, I can't really find anything about it.

Does anyone know how to calculate it? Python preferably, but pseudo-code or anything works. Thanks!

share | improve this question
 
  
Are you talking about the stuff defined in C CHAR_BITtigcc.ticalc.org/doc/limits.html#CHAR_BIT ? –   woozyking  Jul 23 '13 at 0:31
  
Nope, this is related to information theory and entropy, not actual bit size. –  Newmu  Jul 23 '13 at 0:46
add comment

2 Answers

up vote 5 down vote accepted

Bits per character is a measure of the performance of compression methods. It's applied by compressing a string and then measuring how many bits the compressed representation takes in total, divided by how many symbols (i.e. characters) there were in the original string. The fewer bits per character the compressed version takes, the more effective the compression method is.

In other words, the authors use their generative language model, among other things, for compression and make an assumption that a higheffectiveness of the resulting compression method indicates highaccuracy of the underlying generative model.

In section 1 they state:

The goal of the paper is to demonstrate the power of large RNNs trained with the new Hessian-Free optimizer by applying them to the task of predicting the next character in a stream of text. This is an important problem because a better character-level language model could improve compression of text files (Rissanen & Langdon, 1979) [...]

The Rissanen & Langdon (1979) article is the original description ofarithmetic coding, a well-known method for text compression.

Arithmetic coding operates on the basis of a generative language model, such as the one the authors have built. Given a (possibly empty) sequence of characters, the model predicts what character may come next. Humans can do that, too, for example given the input sequence hello w, we can guess probabilities for the next character: o has high probability (because hello world is a plausible continuation), but characters like h as in hello where can I find.. or i as in hello winston also have non-zero probability. So we can establish aprobability distribution of characters for this particular input, and that's exactly what the authors' generative model does as well.

This fits naturally with arithmetic coding: Given an input sequence that has already been encoded, the bit sequence for the next character is determined by the probability distribution of possible characters: Characters with high probability get a short bit sequence, characters with low probability get a longer sequence. Then the next character is read from the input and encoded using the bit sequence that was determined from the probability distribution. If the language model is good, the character will have been predicted with high probability, so the bit sequence will be short. Then the compression continues with the next character, again using the input so far to establish a probability distribution of characters, determining bit sequences, and then reading the actual next character and encoding it accordingly.

Note that the generative model is used in every step to establish a new probability distribution. So this is an instance of adaptive arithmetic coding.

After all input has been read and encoded, the total length (in bits) of the result is measured and divided by the number of characters in the original, uncompressed input. If the model is good, it will have predicted the characters with high accuracy, so the bit sequence used for each character will have been short on average, hence the total bits per character will be low.


Regarding ready-to-use implementations

I am not aware of an implementation of arithmetic coding that allows for easy integration of your own generative language model. Most implementations build their own adaptive model on-the-fly, i.e. they adjust character frequency tables as they read input.

One option for you may be to start with arcode. I looked at the code, and it seems as though it may be possible to integrate your own model, although it's not very easy. The self._ranges member represents the language model; basically as an array of cumulative character frequencies, so self._ranges[ord('d')] is the total relative frequency of all characters that are less than d (i.e. abc if we assume lower-case alphabetic characters only). You would have to modify that array after every input character and map the character probabilities you get from the generative model to character frequency ranges.

share | improve this answer
 
2 
Excellent introduction and explanation, thank you! –   Newmu  Jul 23 '13 at 4:00
  
The model I'm working on generates a probability distribution for the next character given a fixed length of previous characters, so it looks like it'll be some work but I should be able to figure it out. –   Newmu  Jul 23 '13 at 4:14
  
@Newmu It should definitely be possible. But I agree it will be easy to make mistakes. You'll need to test this carefully after implementing it. (Unfortunately I have no insider knowledge about arcode. The code seemed to be relatively easy to understand, which is why I suggested this module. There are others, too, though. If you do a search for "python arithmetic coding" or similar, you may be able to find implementations more suitable.) –   jogojapan  Jul 23 '13 at 4:31
add comment

The sys library has a getsizeof() function, this may be helpful?http://docs.python.org/dev/library/sys

share | improve this answer
 add comment

Not the answer you're looking for? Browse other questions tagged      or ask your own question.

 

转载于:https://www.cnblogs.com/huashiyiqike/p/3571687.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值