很多GIS爱好者或ECDIS开发商在读取S57数据文件时多参考了“ISO8211lib is a C++ library for reading ISO8211-formatted files, such as SDTS and S-57 format “,S57数据NATF字段采用Unicode双字节编码国家属性字段,也就是说S57数据中只有NATF字段的解析与处理涉及了双字节数据问题,特别是NATF字段是可变长度字段,而在
ISO8211.lib的
“DDFSubfieldDefn::GetDataLength( const char * pachSourceData, int nMaxBytes, int * pnConsumedBytes )”
在处理数据长度时最初只考虑了单字节定界符 UT = 31,FT = 30,而将双字节数据当成一种错误数据处理,参考
/* We only check for the field terminator because of some buggy
* datasets with missing format terminators. However, we have found
* the field terminator is a legal character within the fields of
* some extended datasets (such as JP34NC94.000). So we don't check
* for the field terminator if the field appears to be multi-byte
* which we established by the first character being out of the
* ASCII printable range (32-127).
*/
这样会造成测算出的字段长度不正确,因为双字节的字串中的个别字节很可能会出现与定界符冲突,造成数据长度错误,产生丢字问题!考虑到S57规定双字节单元定界符为(0/0) (1/15),字段定界符为(0/0)(1/14) 参见S57 3.10,实际让就是00,1F和00,1E,但在比较时,还要注意系统采用的是小尾序还是大尾序,我使用的系统环境是WINDOWS,大尾序,因此应对上述数据长度测量函数进行修改:
int DDFSubfieldDefn::GetDataLength( const char * pachSourceData,
int nMaxBytes, int * pnConsumedBytes )
{
if( !bIsVariable ) // 如果数据字段是定长字段
{
if( nFormatWidth > nMaxBytes )
{
CPLError( CE_Warning, CPLE_AppDefined,
"Only %d bytes available for subfield %s with\n"
"format string %s ... returning shortened data.",
nMaxBytes, pszName, pszFormatString );
if( pnConsumedBytes != NULL )
*pnConsumedBytes = nMaxBytes;
return nMaxBytes;
}
else
{
if( pnConsumedBytes != NULL )
*pnConsumedBytes = nFormatWidth;
return nFormatWidth;
}
}
else // 数据字段为变长字段
{
int nLength = 0;
int bCheckFieldTerminator = TRUE;
/* We only check for the field terminator because of some buggy
* datasets with missing format terminators. However, we have found
* the field terminator is a legal character within the fields of
* some extended datasets (such as JP34NC94.000). So we don't check
* for the field terminator if the field appears to be multi-byte
* which we established by the first character being out of the
* ASCII printable range (32-127).
*/
if( pachSourceData[0] < 32 || pachSourceData[0] >= 127 ) // 如果第一个字符为不可见字符,则认为数据为双字节字符集,不检查字段定界符
bCheckFieldTerminator = FALSE;
while( nLength < nMaxBytes
&& pachSourceData[nLength] != chFormatDelimeter )
{
if( bCheckFieldTerminator
&& pachSourceData[nLength] == DDF_FIELD_TERMINATOR )
break;
nLength++;
}
if( pnConsumedBytes != NULL )
{
if( nMaxBytes == 0 )
*pnConsumedBytes = nLength;
else
*pnConsumedBytes = nLength+1;
}
return nLength;
}
}
笔者认为可以这样解决(程序上文不变)
if( pachSourceData[0] < 32 || pachSourceData[0] >= 127 )
bCheckFieldTerminator = FALSE;
while( nLength < nMaxBytes)
{
if( bCheckFieldTerminator)
{
if(pachSourceData[nLength] == chFormatDelimeter)
break;
}
else
{
if(pachSourceData[nLength] == chFormatDelimeter && pachSourceData[nLength+1] == 0 )
break;
}
if(pachSourceData[nLength] == DDF_FIELD_TERMINATOR && pachSourceData[nLength+1] ) break;
nLength++;
}
if( pnConsumedBytes != NULL )
{
if( nMaxBytes == 0 )
*pnConsumedBytes = nLength;
else
*pnConsumedBytes = nLength+1;
}
return nLength;
}
实践中很好的解决了问题
第二个问题,汉字乱码
在后面的数据处理时,由于中文操作系统的汉字一般采用GB18030,也就是GBK编码,因此,在显示这些汉字时还要将NATF字段转成GBK
可以直接使用系统的转换函数:
int len = WideCharToMultiByte(54936,0,(LPCWSTR )tmpStr,m_strLength/2,tmpStr1,m_strLength,0,0);
其中54936是GB18030的CodePage代码
其实第一个问题是在处理汉字乱码问题时发现的,原来只考虑了汉字编码转换问题,而且第二个问题解决后,丢字问题才在一个偶然的时候发现,这个问题在一些商用ECDIS上也存在。
以上浅见,请大师们指正。