squid Cache Digest

                                                      Martin Hamilton
                                              JANET Web Cache Service
                                                        Alex Rousskov
                                                                NLANR
                                                        Duane Wessels
                                                                NLANR
                                                        December 1998




                Cache Digest specification - version 5




Status of this memo


This draft document may be updated, replaced, or made obsolete by
other documents at any time. It is inappropriate to use this draft as
reference material or to cite them other than as work in progress.


Cache Digests are experimental and subject to change.  This isn't just
an idle threat - the current specification *will* change soon!  These
changes are likely to affect both the underlying data format and the
digest exchange mechanism.  This document describes the Squid 2.[01]
implementation, specifically that in Squid 2.1PATCH2.




Abstract


This document describes the Cache Digest exchange protocol and data
format, as implemented by version 2 of the Squid WWW cache server.
Cache Digests provide a mechanism for (primarily) WWW cache servers to
cooperate without the latency and congestion problems associated with
previous protocols such as ICP.




1. What are Cache Digests ?


Cache Digests are a response to the problems of latency and congestion
associated with previous inter-cache communications mechanisms such as
the Internet Cache Protocol (ICP) [RFC 2186, RFC 2187] and the
HyperText Cache Protocol [HTCP-ID].  Unlike most of these protocols,
Cache Digests support peering between cache servers without a
request-response exchange taking place.  Instead, a summary of the
contents of the server (the Digest) is fetched by other servers which
peer with it.  Using Cache Digests it is possible to determine with a
relatively high degree of accuracy whether a given URL is cached by a
particular server.  This is done by feeding the URL and the HTTP
method by which it is being requested into a hash function which
returns a list of bits to test against in the Cache Digest [DIGESTS,
BLOOM].


Cache Digests are both a protocol and a data format, in the sense that
the construction of the Cache Digest itself is well defined, and there
is a well defined protocol for fetching Cache Digests over a network -
currently via HTTP.  It's possible that Cache Digests could be
exchanged via other mechanisms, in addition to HTTP, e.g. via FTP.
The Cache Digest is calculated internally by the cache server and can
exist as (for instance) a cached object like any other - subject to
object refresh and expiry rules.


Although Cache Digests as currently conceived are intended primarily
for use in sharing summaries of which URLs are cached by a given
server, this capability can be extended to cover other data sources.
For example, an FTP mirror server might make a Cache Digest available
which indicated matches for all of the URLs by which the resources it
mirrored may be accessed.  This is potentially a very powerful
mechanism for eliminating redundancy and making better use of Internet
server and bandwidth resources.


Cache Digests have been implemented in version 2 of the Squid WWW
Cache Server [SQUID], though they are still considered experimental
and (at the time of writing) must be explicitly enabled when Squid is
compiled.




2. Public keys


Although we've spoken so far about URLs, the keys which are looked up
in Cache Digests are actually formed by performing the MD5 [RFC 1321]
digest function on the concatenation of:


  1. a numeric code for the HTTP method used, and
  2. the URL requested.


Internally within Squid these MD5 values are referred to as 'public
keys', and we'll reuse this terminology here to keep things simple.


Legal values for the method number, and their corresponding HTTP
methods, are:


  GET       1
  POST      2
  PUT       3
  HEAD      4
  CONNECT   5
  TRACE     6
  PURGE     7


The method value is stored as an 8 bit integer - see below for
bit/byte ordering details.  


For example, the public key (MD5 digest value) for the
'http://www.w3.org/' URL fetched via the HTTP 'GET' method is:


  e06a56257d8879d9e968e83f2ded3df7


This is calculated on:


  ASCII:  GET  h  t  t  p  :  /  /  w  w  w  .  w  3  .  o  r  g  /
  Hex:    01   68 74 74 70 3a 2f 2f 77 77 77 2e 77 33 2e 6f 72 67 2f


[ Note that versions of Squid prior to 2.1PATCH2 packed the HTTP
method value incorrectly.  Unfortunately, the Cache Digests version
was not updated in 2.1PATCH2.  Thus, all Cache Digests with versions
0-3 and some with version 4 may contain the endian-ness bug. ]




3. The Cache Digest itself


The Cache Digest is an array of bits.  A hash function is used to
derive the indices into the array which should be set (to register the
presence of a public key) or tested (to look up a given public key in
a given Digest).


In Squid the number of bits in the Digest per public key is currently
set to 5 by default.  The size of the Digest itself depends on both
the number of bits consumed per entry, and the number of public keys
which may be stored in the Digest - referred to as its capacity.


The number of bytes used for the digest can be written as:
  
  digest_size = int((capacity * bits_per_entry + 7) / 8);


Note that these values must be communicated to any entity which is to
make use of the Cache Digest - this is the function of the Cache
Digest header (see below).




4. The Cache Digest hash function


In order to register a given public key in a Cache Digest, a number of
indices into the array (in Squid - four by default, though this may
change) are calculated using a hash function.  This carries out the
following steps:


  1) Calculate the public key from the HTTP method and URL.


  2) Split the resulting 128 bit value into N chunks, say:


       temp_key[0] .. temp-key[N-1]


  3) For each of these N chunks, the corresponding index into the
     cache digest is the digest value modulo the number of bits
     in the digest:


       hash_key[i] = temp_key[i] % (digest_size * 8);


The resulting N indices are the result of the function. N is called
the dimension of a hashing function.


In the 'GET http://www.w3.org/' example above, the public key is:


  e06a56257d8879d9e968e83f2ded3df7


In the default Squid scheme (N is 4) we would split it into the following
four chunks:


  temp_key[0] = 0xe06a5625;
  temp_key[1] = 0x7d8879d9;
  temp_key[2] = 0xe968e83f;
  temp_key[3] = 0x2ded3df7;


And calculate the indices into the cache digest as follows:


  hash_key[0] = temp_key[0] % (digest_size * 8);
  hash_key[1] = temp_key[1] % (digest_size * 8);
  hash_key[2] = temp_key[2] % (digest_size * 8);
  hash_key[3] = temp_key[3] % (digest_size * 8);


Giving us the following four lookup keys (assuming digest_size is 16 
bytes):


  hash_keys[0] = 0x05;
  hash_keys[1] = 0x29;
  hash_keys[2] = 0x5f;
  hash_keys[3] = 0x17;


Note that these are indices into a bit array!




5. Adding a public key to a Cache Digest


Given a public key, the bit in the Cache Digest bit array associated
with each of the indices (as calculated above) should be set.




6. Testing a Cache Digest for a given public key


A given public key is considered to be a match if and only if all of
the bits associated with it (as outlined above) in the Cache Digest are
set to 1.




7. Deleting a public key


You can't delete a single public key from the Digest without
rebuilding it from scratch.  Simply clearing the bits in the Digest
associated with this public key may also clear (some of the) bits
associated with other public keys.




8. Fetching a Cache Digest


To fetch the Cache Digest from the Squid server listening for proxy
HTTP requests on port 3128 of the host 'test.lut.ac.uk',
a proxy HTTP request should be sent to it for the URL:


  http://test.lut.ac.uk:3128/squid-internal-periodic/store_digest


Clients must use 'If-Modified-Since' requests if they have old
Digests, to avoid inadvertently transferring the same Digest more than
once.




The Cache Digest response is formatted as a regular HTTP response
with the Internet Media Type (MIME type):


  application/cache-digest


The normal HTTP status codes should be used to deal with, for
instance, authentication and access controls.  A successful response
would look something like:


  HTTP/1.0 200 OK
  Server: Squid/2.1.PATCH2+Martin
  Mime-Version: 1.0
  Date: Sun, 13 Dec 1998 04:26:46 GMT
  Content-Type: application/cache-digest
  Content-Length: 142
  Expires: Sun, 13 Dec 1998 05:26:46 GMT
  Last-Modified: Sun, 13 Dec 1998 04:26:46 GMT
  X-Cache: HIT from test.lut.ac.uk
  X-Cache-Lookup: HIT from test.lut.ac.uk:3128
  Age: 284
  Proxy-Connection: close


  <followed by body of response>


The body of the response consists of the Cache Digest header, a fixed
length (binary) data structure which specifies various aspects of the
Digest, followed by the Cache Digest itself.  The header and digest
combined come to the total number of bytes specified in the HTTP
Content-Length header - 142 in this case.


Note that all numbers, including the bits of the Cache Digest itself
and the header which precedes it, are transfered in 'network'
(big-endian, most significant byte first) order.




9. Format of the Cache Digest Header


The Cache Digest header is (currently!) a 128 byte data structure
which contains the following fields:


  Field name           Type   Bytes  Function
  --------------------------------------------------------------
  Current version      n      2      Version of this Digest
  Required version     n      2      Minimum version required
  Capacity             N      4      Number of URLs we can fit in
  Count                N      4      Number of URLs actually in digest
  Deletion count       N      4      Number of URL deletion attempts
  Size in bytes        N      4      Size of digest, in bytes
  Bits per entry       C      1      Number of bits per digest entry
  Hash func dimension  C      1      Number of hash function chunks
  Reserved             n      2      Reserved for future expansion
  Reserved             N26    104    Reserved for future expansion


Where:


  C: 8 bit  (1 byte) wide unsigned integer/unsigned char
  n: 16 bit (2 byte) signed integer
  N: 32 bit (4 byte) signed integer


As noted above, "Capacity" is the number of URLs we expected to be in
the digest when we built it.  Squid calculates this value based on the
number of URLs which were cached by the server at the time the digest
was calculated, and recalculates the digest at periodic intervals
-- every hour by default.


"Required version" is the minimum version the decoding algorithm must
support to interpret the digest correctly. If a receiver does not support
"Required version", the entire reply must be ignored. No attempts to
guess a part of the header must be made in the latter case.


The "Reserved" fields are currently unused and must be set to zero.




10. Security considerations


If the contents of the Cache Digest are in any way sensitive, the
Cache Administrator should take care to protect it from access by The
Wrong People.  Note that any methods which would normally be applied
to secure an HTTP connection can be applied to Cache Digests,
e.g. Message Digest Authentication [RFC 2069] or Secure Sockets
Layer/TLS [TLS-ID].


Note that these access control issues also apply to Cache Digests
which are cached alongside regular HTTP objects such as WWW pages - if
not, it may be possible for third parties to download the Cache Digest
of a server without the administrator of the server being aware that
this has happened.  For this reason it may be desirable to have a flag
in the Cache Digest header which indicates whether redistribution is
permitted/encouraged.  Squid does not currently support such a flag,
however.


Since the algorithm is inherently imprecise, the resulting Digest is
likely to contain some false hits for URLs which the cache doesn't
actually contain (as well as some false misses).  If the cache is
configured so as to allow its peers to only request objects which are
already cached (e.g. via Squid's miss_allow option), peers will have
to take care not to assume that a Digest hit means that access to the
server which generated it will be allowed - and should be robust in
the even that access is denied.  This may result in unintentional
denial of service.


Cache A can build a fake peer Digest for cache B and serve it to B's
peers if requested. This way A can direct traffic toward/from B.  The
impact of this problem is minimized by the 'pull' model of
transferring Cache Digests from one server to another, but this
'Trojan horse' attack is possible in a cache mesh.




11. Open issues


The 'application/cache-digest' media type needs to be registered with
IANA, for use with HTTP.


There should be a standard URL for fetching Cache Digests from a
server.  That is, one which doesn't include '/squid-internal-periodic/'.


How to decide the initial capacity?  Squid works on the basis of the
number of cached URLs when it periodically regenerates its Digest, by
dividing the amount of cache disk space in use by the average object
size.


One cache may have more than one digest - e.g. very fresh objects,
somewhat fresh objects, and stale objects.  How to ask for a list of
the digests the cache has?


Rather than exchanging full digests, make "delta" or "diff" style
exchanges of Digests.  For example, when the number of bits changed
since the previous version of the Digest was generated is below a
given threshold.  This implies some degree of synchronization between
Digest aware cache servers, perhaps via DNS style serial numbers in
the Cache Digest headers.


Add info on how to verify that the Digest is not corrupted.  It may be
worth computing an end-to-end checksum for this - e.g. via MD5, since
we already need the MD5 code in order to calculate public keys.


What to do with Cache Digest protocol version numbers in the header.
For example, version 3 implementations should not fetch digests from
version 4 servers, since the set bits corresponding to a given URL
will be different due to the change in HTTP method encoding.


How to add custom info to the header.  Despite the presence of
"reserved" bytes, extending the current format is strongly discouraged.
The next header format will allow for customizations. 


How to calculate the probability of a false hit.


How often to rebuild a Digest.  Squid currently has a hard wired
default of rebuilding the Digest every hour.


How many of the cached object URLs (or percentage of the Digest) to
rebuild in any one chunk - or handle this in its entirety via a
separate thread or (non-blocking) process.




12. Acknowledgments


This work was supported by the Joint Informations Systems Committee
(JISC) of the UK Higher Education funding Councils, as part of the
JANET Web Cache Service, and by the US National Science Foundation
under grants NCR-9521745 and NCR-9616602.




13. References


[RFC 2186] D. Wessels and K. Claffy, "Internet Cache Protocol (ICP),
           version 2", RFC 2186, September 1997.
           <URL:ftp://ftp.ietf.org/rfc/rfc2186.txt>


[RFC 2187] D. Wessels and K. Claffy, "Application of Internet Cache
           Protocol (ICP), version 2", RFC 2187, September 1997.
           <URL:ftp://ftp.ietf.org/rfc/rfc2187.txt>


[HTCP-ID]  P. Vixie and D. Wessels, "Hyper Text Caching Protocol
           (HTCP/0.0)", Internet Draft (work in progress), September
           1998.


[SUMMARY]  L. Fan, P. Cao, J. Almeida, and A. Z. Broder,
  "Summary Cache: A Scalable Wide-Area Web Cache Sharing
  Protocol", Proceedings of SIGCOMM'98,
  <URL:http://www.cs.wisc.edu/%7ecao/papers/summarycache.html>


[DIGESTS]  A. Rousskov and D. Wessels, "Cache Digests",
           Computer Networks and ISDN Systems, 
           Volume 30, Numbers 22-23, November 1998,
           <URL:http://ircache.nlanr.net/Papers/cache-digests.ps.gz>


[BLOOM]    B. Bloom, "Space/time trade-offs in hash coding with
           allowable errors", Communications of the ACM, volume 13,
           pp. 422-426, July 1970.


[SQUID]    D. Wessels, "Squid Homepage", December 1998.
           <URL:http://squid.nlanr.net/>


[RFC 1321] R. Rivest, "The MD5 Message Digest Algorithm", RFC 1321,
           April 1992.  <URL:http://ftp.ietf.org/rfc/rfc132.txt>


[RFC 2069] J. Franks, P. Hallam-Baker, J. Hostetler, P. Leach, A.
           Luotonen, E. Sink and L. Stewart, "An Extension to HTTP:
           Digest Access Authentication", RFC 2069, January 1997.
           <URL:ftp://ftp.ietf.org/rfc/rfc2069.txt>


[TLS-ID]   T. Dierks and C. Allen, "The TLS Protocol, Version 1.0",
           Internet Draft (work in progress), November 1998. 




14. Authors' addresses


Martin Hamilton
JANET Web Cache Service/Department of Computer Studies
Loughborough University
Leics. LE11 3TU, UK
Email: martin@wwwcache.ja.net


Alex Rousskov
National Laboratory for Applied Network Research
NCAR
SCD, Room 22c
1850 Table Mesa Drive
Boulder, CO 80303, USA
Email: rousskov@nlanr.net


Duane Wessels
National Laboratory for Applied Network Research
NCAR
SCD, Room 24D
1850 Table Mesa Drive
Boulder, CO 80303, USA
Email: wessels@nlanr.net


--
$Id: cache-digest-v5.txt,v 1.1 2007/05/13 13:19:07 amosjeffries Exp $

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值