Understanding Strings In COM

COM 专栏收录该内容
0 篇文章 0 订阅
 

Understanding Strings In COM

By Davide Marcato

System Notes

To replicate the steps described in this article, you'll need Windows 95+ or Windows NT 4.0+ and Visual C++ 5.0 or higher.

ANSI and Unicode, char and wchar_t were not enough: COM introduced several new string data types, and the differences and the process of conversion are not always obvious to the uninitiated. This article clarifies the situation once and for all for the benefit of raw COM, ATL and MFC programmers.

Strings, i.e. vectors of alphanumeric characters, are and have always been a fundamental data type in every programming language and platform. Whereas the computer itself prefers to deal with numbers, human beings prefer messages of text to sequences of binary, hexadecimal or even decimal digits. This implies that whenever a piece of software needs to interact with the user (or signal some notable events) some kind of string treatment is likely to come into play.

Until a few years ago strings were just strings, that is, arrays of single-byte data types (char in C/C++) containing the ASCII number of the character at each element. The biggest problem was distinguishing zero-terminated strings (also known as ASCIIZ) from non-zero-terminated arrays. Then came Unicode, a new character set which extended the size of each character from 8 to 16 bits, thus allowing for 65536 theoretical different characters, enough to contain also Far Eastern symbols such as the Kanji standard set. In C/C++ a brand new standard data type was defined to store Unicode strings, wchar_t, and consequently the APIs of Unicode-aware Win32 operating systems that took strings as parameters had to be duplicated to accept both ANSI and Unicode versions.

Just as the Windows programmer community began to get acquainted with this duplication and got into the habit of not assuming anything about the length of a character a priori, COM jumped to the central stage with its burden of new types and aliases. If you are wondering what is the functional difference between an array of OLECHARs and a pointer to a BSTR, when and how it is necessary to convert a string to another type, and what degree of assistance ATL and MFC offer to the developer, this article is for you.

OLECHARs

The main string data type in COM is named OLECHAR, which is the kind of variable expected by almost all COM library functions and well-educated interfaces' methods. An OLECHAR represents a single OLE-compatible character, therefore you can speak of a string only when you have an array of OLECHARs. It is obvious to everyone who has utilized C++ for some time that there is not an OLECHAR built-in data type in the language, as underlined (among other things) by the upper case of the name. The C and C++ standard specifications dictate the existence of only two character types: char and wchar_t. Hence, OLECHAR must be an alias to one of them, and in fact it is. Its relation is established by the standard Win32 header file wtypes.h, which we will meet again later in this article. The following code snippet, adapted from the header file for clarity, represents the official definition of OLECHAR in C/C++:

#if defined(_WIN32) && !defined(OLE2ANSI)
typedef WCHAR OLECHAR;
#else
typedef char OLECHAR;
#endif

The same file defines also the LPOLESTR and LPCOLESTR types:

#if defined(_WIN32) && !defined(OLE2ANSI)
typedef OLECHAR __RPC_FAR *LPOLESTR;
typedef const OLECHAR __RPC_FAR *LPCOLESTR;
#else
typedef LPSTR     LPOLESTR;
typedef LPCSTR    LPCOLESTR;
#endif

as aliases of OLECHAR* and const OLECHAR* in Win32, but aliases of LPSTR and LPCSTR in Windows 3.1x. The __RPC_FAR symbol can be ignored as it expands to nothing, so for all practical purposes BSTR and OLECHAR* can be deployed interchangeably.

As you can see, the BSTR type does not map to the same actual built-in type on every platform. If the code is compiled on 32-bit Windows, which can be detected from the _WIN32 preprocessor symbol definition, all COM characters are Unicode string (WCHAR is itself a typedef'ed data type that translates to the built-in wchar_t type). If not, then the build command is probably targeting Windows 3.1x, which does not support Unicode strings at all, so all the strings are regular old arrays of char. Note that on Sun Solaris, the main UNIX flavor to benefit from a porting of the (D)COM implementation to date, OLECHARs are 16-bit Unicode characters exactly as on Win32.

The original Microsoft engineers who designed COM made a pretty courageous decision: They de facto imposed Unicode to everyone in the 32-bit world at a time when the original version of Windows NT was barely taking shape and the doubled amount of RAM required to hold the same strings could easily become problematic due to the high cost of memory. But the decision proved advantageous, as it saved COM developers from having to implement two variants of each interface (and relative coclasses implementing it) just to deal with every possible type of client.

Now we have seen how to define a COM-compliant character and by extension a COM-compliant string, but we have not revealed yet how one can initialize such a string with a string literal. The following statement:

const OLECHAR* pComStr;
pComStr = "I love VCDJ and COM";

does work in Windows 3.1x because only ANSI strings exist there, but will fail to compile on Win32 and Solaris because we are trying to copy an ANSI string to a Unicode array of characters. The following form:

const OLECHAR* pComStr;
pComStr = L"I love VCDJ and COM";

will give the exact opposite results: working on Win32, incorrect on Windows 3.1. What we really need is a way to define the type of a string irrespective of the platform. Nothing could fit the bill better than a macro, as in the code below:

const OLECHAR* pComStr;
pComStr = OLESTR("I love VCDJ and COM");

The OLESTR() macro is translated differently depending on the target of the build process, so we obtain the correct definition in all cases. Wtypes.h reports it as follows, with some secondary adjustments made to clarify the original code:

#if defined(_WIN32) && !defined(OLE2ANSI)
#define OLESTR(str) L##str
#else
#define OLESTR(str) str
#endif

 

Note: In all other Win32 API implementations there is a discrepancy between Windows 95 / Windows 98 and Windows NT's string treatment, since the former employs one-byte ANSI characters and the latter internally works only with two-byte Unicode characters. However, when it comes to COM, both operating systems agree on the use of Unicode strings.

At this point you may be curious as to why the data type was called OLECHAR rather than the more obvious COMCHAR. The answer to this question has its roots partly in history and partly in marketing: until a few years ago OLE2, the main family of technologies relying on the COM foundation, was deemed more important than COM itself, hence the acronym OLE spread everywhere. The later change of marketing orientation could not be reflected in the symbol names to avoid breaking a lot of existing and correctly functioning COM/OLE code. (See my Q&A column in VCDJ print and online for extensive info on this sometimes unclear transition of terms and intents.)

OLECHARs are the standard way to create strings in COM code and by far the most comfortable as long as C and C++ are used in both the client side and the server side. Other languages and tools bring their burden of special constraints that open the way to another kind of string, which constitute the topic of the next paragraph.

continued...

Copyright © 1999 - Visual C++ Developers Journal

 


BSTRs

B-strings, more properly called Basic strings, are a special kind of string format. Instead of comprising a classic array of characters followed by a NUL character (code /0) that marks the termination of the array, the structure of the data in memory is a superset of OLECHAR. In short, a BSTR is a null-terminated array of OLECHARs prefixed by its length. The string length is determined by the character count, not by the index of the first null character.

This presence of the length of the object before the actual array data renders these strings suitable for manipulation in high-level tools like Visual Basic (for which this string format was invented in the first place) and Java on a COM-aware virtual machine like Microsoft's JVM. Actually, there is no other way to exchange string-like data with components written in those languages than to employ BSTRs. While in C and C++ the developer has to understand and use the data type in a rather uncomfortable manner, both Visual Basic and Java encapsulate them into their traditional string types, respectively String and java.lang.String. The final developer is therefore shielded from the subtleties of the organization of the raw bytes in memory. Moreover, the tools take care of allocating and freeing the memory required to contain their content without the programmer needing to know how this process works behind the scenes.

This is the brilliant side of the medal of course. You as the C/C++ hardcore engineer get the tough part of the work, since you need to learn a completely new specific set of APIs that carry out the basic operations with Basic strings. The family of functions is amazingly named "system strings management API" and its members can easily be distinguished by the "Sys" prefix in their names.

The following code snippet, borrowed from Oleauto.h (this stuff used to be most useful when coupled with Automation, as Visual Basic's COM support was a lot less powerful then), shows the prototypes of each of the functions in the group:

/*---------------------------------------------------------------------*/
/*                            BSTR API                                 */
/*---------------------------------------------------------------------*/
 
     
WINOLEAUTAPI_(BSTR) SysAllocString(const OLECHAR *);
WINOLEAUTAPI_(INT)  SysReAllocString(BSTR *, const OLECHAR *);
WINOLEAUTAPI_(BSTR) SysAllocStringLen(const OLECHAR *, UINT);
WINOLEAUTAPI_(INT)  SysReAllocStringLen(BSTR *, const OLECHAR *, UINT);
WINOLEAUTAPI_(void) SysFreeString(BSTR);
WINOLEAUTAPI_(UINT) SysStringLen(BSTR);
 
     
#ifdef _WIN32
WINOLEAUTAPI_(UINT) SysStringByteLen(BSTR bstr);
WINOLEAUTAPI_(BSTR) SysAllocStringByteLen(LPCSTR psz, UINT len);
#endif

Don't be unnerved by the probably unfamiliar WINOLEAUTAPI_() word preceding all the functions; it is simply a macro defined in the same header file that expands to a long list of modifiers necessary to adjust the calling convention, exportation details, and return type. You can blissfully ignore it for our purposes.

The following table briefly describes the task of each routine:

Function name

Description

SysAllocString()

Allocates a new BSTR and initializes it with an OLECHAR*

SysReAllocString()

Reallocates an existing BSTR and initializes it with an OLECHAR*

SysAllocStringLen()

Allocates a new BSTR, copies a specified number of characters from the passed OLECHAR* into it, and then appends a null character

SysReAllocStringLen()

Reallocates an existing BSTR, copies a specified number of characters from the passed OLECHAR* into it, and then appends a null character

SysFreeString()

Deallocates a BSTR

SysStringLen()

Returns the number of characters in a BSTR

SysStringByteLen()

Returns the length in bytes of a BSTR (Win32 only)

SysAllocStringByteLen()

Allocates a BSTR that contains the ANSI string passed as a parameter. Does not perform any ANSI-to-Unicode translation (Win32 only)

The succinct description provided above, in conjunction with the official documentation, should be everything you will ever need to know to deal with BSTRs. Note that the expected usage pattern is the preventive allocation of an array of OLECHARs, which is later copied into the system string.

Basic strings must be allocated and freed manually. But who has the responsibility of doing so when function calls are involved? This is a general COM question and so the answer does not apply solely to strings. If the parameter is input-only (IDL attribute [in]) the caller is responsible for both the creation and the destruction of the variable. If the parameter is output-only (IDL attribute [out]) then the callee is responsible for the allocation of the string, but the caller is expected to free it after use. If the parameter is both input and output (IDL attribute [in, out]) then the caller allocates the string and after the method invocation frees the memory. The callee though is allowed to reallocate the string if necessary to do so before returning it to the caller.

Obviously these details interest C/C++ developers only, as Visual Basic will continue to treat strings as usual without any special consideration.

BSTR wrappers

Both ATL and MFC offer particular support for simplified BSTR management. ATL does it by means of a specialized wrapper class, CComBSTR, whose declaration in atlbase.h looks like the following (stripped down as usual for clarity and space constraints):

class CComBSTR
{
public:
                    BSTR m_str;
                    CComBSTR();
                    CComBSTR(int nSize, LPCOLESTR sz = NULL);
                    CComBSTR(LPCOLESTR pSrc);
                    CComBSTR(const CComBSTR& src);
                    CComBSTR& operator=(const CComBSTR& src);
                    CComBSTR& operator=(LPCOLESTR pSrc);
                    ~CComBSTR();
                    unsigned int Length() const;
                    operator BSTR() const;
                    BSTR* operator&();
                    BSTR Copy() const;
                    void Attach(BSTR src);
                    BSTR Detach();
                    void Empty();
#if _MSC_VER>1020
                    bool operator!();
#else
                    BOOL operator!();
#endif
                    void Append(const CComBSTR& bstrSrc);
                    void Append(LPCOLESTR lpsz);
                    void AppendBSTR(BSTR p);
                    void Append(LPCOLESTR lpsz, int nLen);
                    CComBSTR& operator+=(const CComBSTR& bstrSrc);
#ifndef OLE2ANSI
                    CComBSTR(LPCSTR pSrc);
                    CComBSTR(int nSize, LPCSTR sz = NULL);
                    CComBSTR& operator=(LPCSTR pSrc);
                    void Append(LPCSTR);
#endif
                    HRESULT WriteToStream(IStream* pStream);
                    HRESULT ReadFromStream(IStream* pStream);
};

The utilization of the class is very straightforward even for the non-ATL experts. Basically the features offered are:

  • encapsulation of the allocation and deallocation procedures within the constructor and destructor;
  • duplication of the contents (through CComBSTR::Copy());
  • possibility to append almost any kind of string to the wrapped BSTR exploiting the overloading feature of C++;
  • support for readable string comparisons through the customized ! operator;
  • basic I/O operations to store the contents of the string to, and retrieve it from, a structured storage stream.

On the other hand, MFC does not provide any direct wrapper class for system strings. All the support is an integral part of the extremely versatile Cstring class. As shown in the following code snippet borrowed from the class's prototype in afx.h, there are only a couple of methods specifically generating COM strings:

// OLE BSTR support (use for OLE automation)
BSTR AllocSysString() const;
BSTR SetSysString(BSTR* pbstr) const;

Internally CString::AllocSysString() allocates a new BSTR using the APIs we examined in an earlier paragraph and copies its contents to the newly created system string, which is eventually returned to the caller. There is no such function as CString::FreeSysString(), so to deallocate the memory occupied by the returned BSTR, the global API ::SysFreeString() will have to be called. CString::SetSysString() instead reallocates the BSTR pointed to by the parameter and copies its contents into it. Both methods throw CmemoryException exception objects in case of memory allocation problems.

Moreover, if you are using Visual C++ 5.0 or higher, you can exploit the Direct To COM proprietary extension which includes, among many other things, a _bstr_t class. The documentation reports that it is defined inside comdef.h, while in reality its declaration resides in comutil.h. The degree of encapsulation and functionality is similar to ATL's CComBSTR, but remember that using the COM compiler support binds you to Visual C++ even more than ATL would do. Probably the most relevant difference between the two implementations is that _bstr_t raises C++ exceptions and thus requires your code to be prepared to catch them, whereas CComBSTR does not. This detail will likely influence your choice more than all the other possible considerations. The following code listing summarizes the public interface of _bstr_t; the comments should make it easy to understand what the diverse method groups are up to:

class _bstr_t {
public:
                    // Constructors
                    //
                    _bstr_t() throw();
                    _bstr_t(const _bstr_t& s) throw();
                    _bstr_t(const char* s) throw(_com_error);
                    _bstr_t(const wchar_t* s) throw(_com_error);
                    _bstr_t(const _variant_t& var) throw(_com_error);
                    _bstr_t(BSTR bstr, bool fCopy) throw(_com_error);
 
     
                    // Destructor
                    //
                    ~_bstr_t() throw();
 
     
                    // Assignment operators
                    //
                    _bstr_t& operator=(const _bstr_t& s) throw();
                    _bstr_t& operator=(const char* s) throw(_com_error);
                    _bstr_t& operator=(const wchar_t* s) throw(_com_error);
                    _bstr_t& operator=(const _variant_t& var) throw(_com_error);
 
     
                    // Operators
                    //
                    _bstr_t& operator+=(const _bstr_t& s) throw(_com_error);
                    _bstr_t operator+(const _bstr_t& s) const throw(_com_error);
 
     
                    // Friend operators
                    //
                    friend _bstr_t operator+(const char* s1, const _bstr_t& s2);
                    friend _bstr_t operator+(const wchar_t* s1, const _bstr_t& s2);
 
     
                    // Extractors
                    //
                    operator const wchar_t*() const throw();
                    operator wchar_t*() const throw();
                    operator const char*() const throw(_com_error);
                    operator char*() const throw(_com_error);
 
     
                    // Comparison operators
                    //
                    bool operator!() const throw();
                    bool operator==(const _bstr_t& str) const throw();
                    bool operator!=(const _bstr_t& str) const throw();
                    bool operator<(const _bstr_t& str) const throw();
                    bool operator>(const _bstr_t& str) const throw();
                    bool operator<=(const _bstr_t& str) const throw();
                    bool operator>=(const _bstr_t& str) const throw();
 
     
                    // Low-level helper functions
                    //
                    BSTR copy() const throw(_com_error);
                    unsigned int length() const throw();
 
     
private:
                    // [...private stuff omitted...]
}

continued...

Copyright © 1999 - Visual C++ Developers Journal

 


Frameworks and conversions

Your ideas of OLECHAR and BSTR and your understanding of the manner COM handles strings should be much clearer now, but we still have to cope with type conversions to and from these somewhat special data types and the more traditional TCHAR, WCHAR and char.

ATL and MFC both use the same group of macros to deal with string conversions. These macros' names follow a precise convention: the characters before the "2" indicate the original type of the variable to convert, and the characters after the "2" indicate the destination type after the conversion. The following table lists the valid symbols in a conversion macro name:

Short name

Data type

A

LPSTR, char*

OLE

LPOLESTR

T

LPTSTR, TCHAR*

W

LPWSTR, wchar_t*

BSTR

BSTR

C

const - associated to another type

The macros operate intelligently: if for some reason the source and destination types coincide, the code does not waste time in a useless process. Internally most of the macros call the _alloca() run-time library function and allocate the storage for the new data on the stack, as this simplifies the deallocation policy by delegating it to the rules of the variables scope. For this reason, a USES_CONVERSION macro must be put just before the conversion operation in each function or class method that contains the macros. The following sample code, taken from the downloadable sample pack available on the Web, will clarify the process:

// Conversions through MFC/ATL's macros
void Sample2()
{
                    USES_CONVERSION;
 
     
                    // ANSI
                    LPSTR ansiStr = "This is a sample message";
                    printf("BEFORE the string contains: %s/n", ansiStr);
 
     
                    // ANSI -> const TCHAR
                    const TCHAR* pTChar = A2CT(ansiStr);
                    _tprintf(_T("MIDWAY the string contains: %s/n"), pTChar);
 
     
                    // const TCHAR -> Unicode
                    LPWSTR wStr = T2W(pTChar);
                    wprintf(L"AFTER the string contains: %s/n", wStr);
}

As I stated earlier, the conversion macros are part of MFC and ATL, but surprisingly the header files are not directly shared by the two frameworks. ATL programmers should include Atlconv.h, while MFC developers are supposed to include Afxconv.h in their projects. After digging into the sources I found that in the latest versions of MFC, Afxconv.h does little more than include Atlconv.h itself, so in practice the string conversion code exposed by the two frameworks is the same.

The COM compiler support offers good BSTR conversion code, too. The actual conversion functions are the cast operators that convert a _bstr_t to either an ANSI or a Unicode string, either constant or not, plus the omnipresent class constructors. The following code snippet, taken from the downloadable sample pack available on the Web, shows some common usage patterns of BSTR conversions:

// BSTR conversions
void Sample3()
{
                    USES_CONVERSION;
 
     
                    LPWSTR wStr = L"This is a sample message";
                    wprintf(L"BEFORE the string contains: %s/n", wStr);
 
     
                    BSTR bstr1 = W2BSTR(wStr);
 
     
                    CString mfcStr = bstr1;
                    printf("MIDWAY the string contains: %s/n", (LPCSTR)mfcStr);
 
     
                    BSTR bstr2 = mfcStr.AllocSysString();
                    _bstr_t bstr3 = bstr2;
                    VERIFY(bstr3 == (_bstr_t)bstr1);
                    
                    WCHAR* wStr2 = bstr3;
                    wprintf(L"AFTER the string contains: %s/n", wStr2);
 
     
                    ::SysFreeString(bstr1);
                    ::SysFreeString(bstr2);
}

This is music to the ears of those who work with more or less advanced frameworks, but what about those who prefer (or are compelled) to stick to low-level C++ COM development? They can still use the standard library conversion function, which the framework macros themselves rely on ultimately, such as mbstowcs() and wcstombs(). Unfortunately such functions are not aware of BSTRs and OLECHARs, so heavy use of conditional compilation would be required to deal with the various combinations possible, and this is evil from a readability perspective. Power developers who program COM at the raw C++ level will probably find out naturally at a certain point in their learning curve and experience how to write a set of conversion macros by themselves. If you are lazy and prefer to use a precooked set of macros, you can freely use Don Box's YACL (the acronym stands for "Yet Another COM Library") which is extremely efficient and does much more than just string conversion in pure C++ COM. The URL for the download is http://www.develop.com/dbox/yacl.htm.

Conclusion

After some study and direct experimentation, the various string types in COM prove to be much less cryptic and problematic than at first they seemed. Fundamentally, it all boils down to recognizing and memorizing a handful of new data types which may behave differently on different platforms, and getting used to the framework of handy conversion functions provided by ATL, MFC, or the Direct To COM Visual C++ extension. Regardless of which of the mentioned tasks you are going to tackle, I hope this article will serve as a valuable aid in saving precious time working with strings.

article index

Copyright © 1999 - Visual C++ Developers Journal

 

 

  • 0
    点赞
  • 0
    评论
  • 0
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
<p> <span style="font-size:14px;color:#337FE5;">【为什么学爬虫?】</span> </p> <p> <span style="font-size:14px;">       1、爬虫入手容易,但是深入较难,如何写出高效率的爬虫,如何写出灵活性高可扩展的爬虫都是一项技术活。另外在爬虫过程中,经常容易遇到被反爬虫,比如字体反爬、IP识别、验证码等,如何层层攻克难点拿到想要的数据,这门课程,你都能学到!</span> </p> <p> <span style="font-size:14px;">       2、如果是作为一个其他行业的开发者,比如app开发,web开发,学习爬虫能让你加强对技术的认知,能够开发出更加安全的软件和网站</span> </p> <p> <br /> </p> <span style="font-size:14px;color:#337FE5;">【课程设计】</span> <p class="ql-long-10663260"> <span> </span> </p> <p class="ql-long-26664262" style="font-size:11pt;color:#494949;"> 一个完整的爬虫程序,无论大小,总体来说可以分成三个步骤,分别是: </p> <ol> <li class="" style="font-size:11pt;color:#494949;"> 网络请求:模拟浏览器的行为从网上抓取数据。 </li> <li class="" style="font-size:11pt;color:#494949;"> 数据解析:将请求下来的数据进行过滤,提取我们想要的数据。 </li> <li class="" style="font-size:11pt;color:#494949;"> 数据存储:将提取到的数据存储到硬盘或者内存中。比如用mysql数据库或者redis等。 </li> </ol> <p class="ql-long-26664262" style="font-size:11pt;color:#494949;"> 那么本课程也是按照这几个步骤循序渐进的进行讲解,带领学生完整的掌握每个步骤的技术。另外,因为爬虫的多样性,在爬取的过程中可能会发生被反爬、效率低下等。因此我们又增加了两个章节用来提高爬虫程序的灵活性,分别是: </p> <ol> <li class="" style="font-size:11pt;color:#494949;"> 爬虫进阶:包括IP代理,多线程爬虫,图形验证码识别、JS加密解密、动态网页爬虫、字体反爬识别等。 </li> <li class="" style="font-size:11pt;color:#494949;"> Scrapy和分布式爬虫:Scrapy框架、Scrapy-redis组件、分布式爬虫等。 </li> </ol> <p class="ql-long-26664262" style="font-size:11pt;color:#494949;"> 通过爬虫进阶的知识点我们能应付大量的反爬网站,而Scrapy框架作为一个专业的爬虫框架,使用他可以快速提高我们编写爬虫程序的效率和速度。另外如果一台机器不能满足你的需求,我们可以用分布式爬虫让多台机器帮助你快速爬取数据。 </p> <p style="font-size:11pt;color:#494949;">   </p> <p class="ql-long-26664262" style="font-size:11pt;color:#494949;"> 从基础爬虫到商业化应用爬虫,本套课程满足您的所有需求! </p> <p class="ql-long-26664262" style="font-size:11pt;color:#494949;"> <br /> </p> <p> <br /> </p> <p> <span style="font-size:14px;background-color:#FFFFFF;color:#337FE5;">【课程服务】</span> </p> <p> <span style="font-size:14px;">专属付费社群+定期答疑</span> </p> <p> <br /> </p> <p class="ql-long-24357476"> <span style="font-size:16px;"><br /> </span> </p> <p> <br /> </p> <p class="ql-long-24357476"> <span style="font-size:16px;"></span> </p>
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值