prev Translate Page next

Perl 5.8 and Unicode:
Myths, Facts and Changes

by Dan Kogai

↓"translate"? What the heck is that?

It doesn't make sense!

日本語 2.0

What’s Unicode?

http://www.unicode.org/standard/WhatIsUnicode.html
says

Unicode provides a unique number for every character

no matter what the platform,

no matter what the program,

no matter what the language.

Why Unicode?

Whipaptitude
with
Manipulexity

Because we want the easy jobs should be easy without making the hard jobs impossible.

no matter in what languages your jobs are done.

We wanted perl5 to be a language for getting multilingual jobs done.

English

“Latin” Lauguages

汉语

にホン語

한굴

漢語

Ελληνικά

Русский

עברית

عربية

You name it.

Perl 5.6 Way

the utf8 pragma was introduced

Regexp OK!
s/小飼\s*弾/Dan Kogai/g

Enough?

No!

You could hardly tell if a scalar was a utf8 string or binary data.

There was no way to transcode from/to other encodings.

Unicode rules -- as an internal representation (Windows, MacOS X) while external representations remain "legacy"

Perl 5.6 = a city without bridges

Perl 5.8 Way

The implicit utf8 flag was introduced to scalars

so strings are strings, binaries are binaries

perldoc perluniintro

The Encode modules was introduced

(I did)

so you can transcode from/to "legacy" encodings

perldoc Encode

The UTF-8 Flag

On perl 5.8, only scalars marked as utf8 are treated as such

even when the string is a valid utf8

You can use Encode::is_utf8(), utf8::is_utf8() or Devel::Peek to tell the flag.

On perl 5.8, you can use Encode; to turn your data into utf8 and vice versa

Started by Nick Ing-Simmons,

Maintained by Dan Kogai.

decode() then encode()
use strict;
use utf8;
use Encode;
for my $argv (@ARGV){
    open my $fh, "<", $argv or die "$argv : $!";
    while(<$fh>){
        my $utf8 = decode("eucjp", $_);
        $utf8 =~ s{ (?:小飼|こがい|コガイ|Kogai)
                    [\s\x{3000}]* # \s + FULLWIDTH SPACE
                    (?:弾|だん|ダン|Dan)
                  }{Encode Maintainer}gmsx;        
        print encode("eucjp", $utf8);
    }
}
PerlIO and open()
use strict;
use utf8;
use Encode;
for my $argv (@ARGV){
    open my $fh, "<encoding(eucjp)", $argv 
      or die "$argv : $!";
    while(<$fh>){
        $utf8 =~ s{ (?:小飼|こがい|コガイ|Kogai)
                    [\s\x{3000}]* # \s + FULLWIDTH SPACE
                    (?:弾|だん|ダン|Dan)
                  }{空気嫁}gmsx;        
        print encode("eucjp", $_);
    }
}
binmode()
use strict;
use utf8;
use Encode;
binmode(STDOUT, "encoding(eucjp)");
for my $argv (@ARGV){
    open my $fh, "<encoding(eucjp)", $argv
      or die "$argv : $!";
    while(<$fh>){
        $utf8 =~ s{ (?:小飼|こがい|コガイ|Kogai)
                    [\s\x{3000}]* # \s + FULLWIDTH SPACE
                    (?:弾|だん|ダン|Dan)
                  }{404 Replacement Not Found}gmsx;        
        print;
    }
}

Do's and Dont's

Do not assume your files are in UTF-8.

decode() before you use it.

encode() back when you are done.

Write your script in UTF-8.

If you can't, store your string literals elsewhere and let Encode handle it.

or let open(... "<encoding(foo)") and binmode() to take care of the transcoding.

Remember ASCII is just a part of the world.

So are Unicode and UTF-8.

There are more than one way to do it.

more than one way to spell
駱駝

Thank You!

Oh, one more thing.
(a la Steve Job)

Jcode and Encode

Jcode.pm was a de facto standard to handle Japanese encodings.

Encode.pm is a de jure standard to handle any encodings

Encode.pm can handle everything Jcode.pm could and much more.

I thought that was enough.

But their APIs were different.
print encode('euc-jp', decode('shiftjis', $bytes));
# vs
print Jcode->new($bytes, 'sjis')->euc;

People kept using Jcode.pm in spite of Encode.pm

What do we need?

Good Wrapper!

But I was procrastinating.

Then came JEncode.pm.

Sorry, I was not impatient enough.

But I still got my laziness for rescue.

Jcode 2.0

Jcode. 0.8x for Perl 5.6.x and below

Encode wrapper for Perl 5.8 and above.

You can do this as you always have been doing.
print Jcode->new($bytes, 'sjis')->euc;
But now you can do this, too!
print Jcode->new($bytes, 'big5')
      ->fallback(Encode::FB_XMLCREF)->eucjp;

And they lived happily ever after.

By the way.

In Perl 6, everything is an object.
"There are more than one way to do it".say;
Then why not this?
"やり方は一つじゃない".shiftjis.say;
Or this?
qq(Do what I mean!)
  .translate(from => 'english', 
             to   => 'japanese',
             via  => 'http://translate.livedoor.com/'
            ).eucjp(fallback => PERLQQ).eval;

Well, let me think about it.

Thank You!

Questions?