[perl]Wide character in print

binmode DATA, ":utf8";


----------------------------------------

Unicode-processing issues in Perl and how to cope with it

Perl 5.8+ has comprehensive support for Unicode and a wide range of different text encodings.  But still many people experience problems when processing multi-language text.  Here I explain the most common problems and offer solutions.

An older version of this article is available.  It is not as well structured, but provides some additional perl version 5.6.1 unicode-related details.

You can read this piece and dive into all the technical details and idiosyncrasies of perl and unicode.  Or you can hire me to fix your code.

A bunch of perldoc manpages outline and explain the Perl’s unicodesupport. perluniintro, perlunicode,Encode module, binmode() function.  And thelist is not complete.  The major problem with this documentation is itsvolume.  Most programmers don’t even have to read it all, because to start working with Unicode you just need to know some basic facts andrules.

I have experienced several kinds of trouble with Unicode in Perl, in several projects.  The two main problems I’ve seen are:

  • UTF-8 data getting double-encoded or other-encoding data getting mangled
  • “Wide character in print” warning

These two problems are closely related and often solvedby similar moves.

Reading or at least browsing through the relatedmanpages is still a good way to understand and solve your Unicodeproblems.  If you don’t have time for that now, read on.

The problem showcase: the example

Imagine two simple variables with Unicode text in it. And you print those variables to standard output.  What may be easier?..

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;

print "$ustring1$ustring2";
__DATA__
Hello ☺!

source

Both variables here contain the same data: string"Hello " followed by Unicode character WHITESMILING FACE U+263A, an exclamation mark and a new-linecharacter.  The __DATA__ part ($ustring2) isUTF-8 encoded.

But when we print it, the first one comes out fine andthe second one comes garbled.  This is because Perl knowsthat the first string is a Unicode string and is internallystored in UTF-8.  But it doesn’t know the encoding of thesecond.  When it builds a bigger string for printing, itre-encodes the second into UTF-8, wrongly.

In addition, it prints a warning: Wide character in print at unitest1.pl line 6, <DATA> line 1.We’ll look at it later, afterwe fix our output.

You could apparently fix things by avoiding concatenation:

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;

print $ustring1, $ustring2;
__DATA__
Hello ☺!

source

But this is not a solution. Sometimes you simply can’t avoidconcatenation; it is such a basic operation.  In addition,it is error-prone and not future-proof.

Why the problem happens

First, some basic facts.

There is a distiction between bytes and characters. Characters are Unicode characters.  One character may be represented by several bytes, when stored, printed or sent over network.  That depends on a particular encoding used. UTF-8 is just one of the ways to do represent Unicode data.

Perl has a “utf8” flag for every scalar value, which may be “on” or “off”. “On” state of the flag tells perl to treat the value as a string of Unicode characters.

If you take a string with utf8 flag off and concatenate it with a string that has utf8 flag on, perl converts the first one to Unicode.

This may sound okay and obvious. But then you think: How? Perl will need to know the encoding of the string data before converting it.  And perl will try to guess it.  And this is the usual source of problems.

The algorithm perl uses when guessing is documented (uses some defaults and maybe checks your locale), but my firm suggestion is: never let perl do that.  Otherwise, there is a BIG chance that you’ll get double-encoded UTF-8 strings, or otherwise mangled data.

The solution: always make data encoding explicit, both for your input and output.

Solution #1: Convert string to Unicode

One solution could be to tell perl that the$ustring2 contains Unicode data in UTF-8encoding.  There is a couple of ways to do that; theorthodox way is through Encode’s decode_utf8()function:

#!/usr/bin/perl

use Encode;
my $ustring1 = "Hello \x{263A}!\n";  
my $ustring2 = <DATA>;
$ustring2 = decode_utf8( $ustring2 );

print "$ustring1$ustring2";
__DATA__
Hello ☺!

source

In this simple case both ways would do the job, but mayget quite tedious, if your imports are plentiful.  And it still prints the “Wide character” warning.

But this is what you should always do for theinternational data you get from other modules, like fromdatabases.

You should not forget though, that not every sequence of bytes is valid UTF-8.  So the decode_utf8() operation mayfail.  See Encode perldoc for error handlingdetails.

Another way to do let perl accept the UTF-8 data as suchis with a pack “U0C*”,unpack “C*” hack.

If you get data in another encoding (not UTF-8), convertit to Unicode explicitly.  Again, Encode module,decode() function:

require Encode;
my $ustring = Encode::decode( 'iso-8859-1', $input );

Another example: UTF-8 data from CGI

In ACIS we produce HTML pages in UTF-8.  We expect the HTML form input to be UTF-8 as well.  To manipulate it, we tell perl about the encoding:

require Encode;
require CGI;
my $query = CGI ->new;
my $form_input = {};  
foreach my $name ( $query ->param ) {
  my @val = $query ->param( $name );
  foreach ( @val ) {
    $_ = Encode::decode_utf8( $_ );
  }
  $name = Encode::decode_utf8( $name );
  if ( scalar @val == 1 ) {   
    $form_input ->{$name} = $val[0];
  } else {                      
    $form_input ->{$name} = \@val;  # save value as an array ref
  }
}

This builds a ready- and safe-to-use hash of inputparameters.

Solution #2: Specify IO encoding layers foryour filehandles

In Perl 5.8 a filehandle can have an encoding specified for it.  Perl then will convert all input from the file automatically into its internal Unicode encoding.  It will mark the values read from it accordingly with the utf8 flag.  Equally, perl can convert output to a specific encoding for a filehandle.  Additionally, perl checks that the data you output is valid for the filehandle’s encoding.

So, if you read data from a file or another input stream,and you expect UTF-8 data there, warn perl:

if ( open( FILE, "<:utf8", $fname ) ) {
  . . . 
}

or, in case of our simple test,

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
binmode DATA, ":utf8";
my $ustring2 = <DATA>;

print "$ustring1$ustring2";
__DATA__
Hello ☺!

source

This should print two equal lines and make no annoying warning.

Similarly, if you open a file as:

open FILE, "<:encoding(iso-8859-7)", $filename;

it’s content will be assumed to be in iso-8859-7 encoding.  Perl will use that to interprete file’s data correctly, i.e. to convert it to the internal UTF-8.

Solution #3: Global Unicode setting in Perl

And there is yet another way to approach yourcoding/encoding problems.  It is to command perl to treatall your program’s input and output as UTF-8 by default.-C is a perl switch which let’s you do that.Just put -CS on the perl command line.

Alternatively, use PERL_UNICODE environmentvariable.  It has to be set in the environment where youexecute perl, for instance:

god@world:~$ PERL_UNICODE=S perl script.pl

Would command perl to assume UTF-8 in all input andoutput filehandles in your script and used modules, by default. (Unfortunately and contrary to my expectationsthis does not have an impact on the special DATA filehandle.So this is not a solution to our problem showcase script.)

You can also specify UTF-8-ness for just your stdin or just stdout or just stderr.  Read a section on-C in perlrun for full details.

Wide character in print warning

The warning happens when you output a Unicode string to anon-unicode filehandle. What is a "non-unicodefilehandle?", you ask.  That’s the one with no unicode-compatible IO layer on it (see Solution #2 section above.)

The right way to fix this is to specify the outputencoding explicitly, with the binmode() function or in youropen() call.  For example, open your file this way:

open FILE, ">:utf8", $filename;

To print UTF-8 to standard output (or standard error), asin our case, we do:

#!/usr/bin/perl

my $ustring1 = "Hello \x{263A}!\n";  
binmode DATA, ":utf8";
my $ustring2 = <DATA>;
binmode STDOUT, ":utf8";
print "$ustring1$ustring2";
__DATA__
Hello ☺!

source

The wrong way to avoid the warning is to turn off theutf8 flag on your to-be-printed data.  Then the characterswill turn into bytes, and perl will push them to a bytes-filehandle smoothly.  But you don’t need that,really.

On the other hand, if you open a file as:

open FILE, ">:encoding(iso-8859-7)", $filename;

the stuff you print will be output in iso-8859-7encoding, transcoded automatically.  ISO-8859-7 is not a Unicode-compatible charset, so you won’t be able to outputUnicode characters on it without a warning.

The right strategy: summary

If you can, use a Unicode encoding (such as UTF-8) to store and process your data.  Always make sure perl knowswhich encoding your data comes in and come out.  Make sureall your Unicode-containing scalars, have the utf8 flag on.Then you can safely concatenate strings.  Then you can useUnicode-related regular expressions, which gives you greatpowers for international (multi-language) textprocessing.

To achieve that, you may need to know all the ways datagets into your program.  As soon as you get some input, markit as Unicode or convert it to Unicode and sleep well.

Sometimes data comes into your program already in Unicodeand you shouldn’t worry.  For instance, XML parsers returnyou string values with the utf8 flag “on”. (Unless you do something weird, like getting it in original form from theparser, which you shouldn’t do anyway.) In the aboveexample we explicitly include a unicode character into a string ($ustring1) and perl knows itsencoding.

But when you read data from input streams, from a database or from environment variables (like parameters inCGI), you need to tell perl about its encoding.

Use PERL_UNICODE environment variable to force UTF-8 IO layers on your input and/or output filehandles.



转自:http://ahinea.com/en/tech/perl-unicode-struggle.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值