转载一篇云计算会议笔记_云计算会议要记-CSDN博客

Fall Forecast 2008, Computing Among the Clouds

Just a quick note on the one-day cloud-computing focused conference I just attended, Computing Among the Clouds. Of particular interest to this blog was a presentation by Joe Gregorio on GAE.

The talk was basically covering the tutorial to GAE, so I personally didn’t get anything out of it, but I serendipitously sat next to him during the previous talk and had a chance to talk to him during the intermission about a hot-button topic: bulk data loading.

A little context, I recently went looking for a way to set a key_name value using the packaged bulk-loading tools. By default, the bulk load tools turn CSV rows into Datastore entities using a sequntial numerical ID and not a key_name, even when there is a column “key_name” if the type descriptor. I wanted to created entities with varchar keys, without having to create a new entity in a custom handler, in an effort to minimize CPU usage during uploads, which is a known problem. In the package google.appengine.ext.bulkload there are a pair of methods that set a key_name if defined in the uploaded data, but these are tied to a cryptic mention of a “version-1″ format.

I asked Joe whether he knew what these methods were about, if Google was working on better tools for data upload / download / sync, or at least if he new what “version 1″ format data was or might possibly refer to and he pleaded ignorance on all counts. Reflecting on this conversation, I think I have to call bullshit here, at risk of going against my “no-negative vibes” mantra. I think he knew exactly what I was talking about and for whatever reason was not at liberty to disclose details. Which would have been a fine answer by me frankly.

Why I am calling Joe out about this? Well, mainly because I just found the methods the night prior to the conference as I was researching a project and he was the GAE representative at the conference. Sorry, but them’s the breaks.

Why do I think he maybe could have given me a more reasonable answer than pleading complete ignorance? Two reasons: (1) Protocol Buffers and (2) the released protocol buffer version 2 code for the memcached API on the groups list. Version 1 I think refers to protocol buffers version one, which has just been upgraded to version 2 and GAE has already announced that V2 specs are going through QA. My thinking is that this is either someone’s 20%, or that protocol buffer client/servers are used internally at Google to load data (or both) and somehow these methods have ended up in the HEAD branch by mistake. There is certainly no released client that talks to these server methods, and no documentation elsewhere in the code base, official API reference, or articles that hint at how the PB loads would work or what is required to make them work.

Now this is all perfectly understandable since PB V2 is coming soon to all parts of GAE, and it would be confusing to say the least to release some uber-complicated stream protocol that is soon to be replaced. But don’t plead complete ignorance, that’s just insulting.