| prev Translate | Page next | |
by Dan Kogai
↓"translate"? What the heck is that?
It doesn't make sense!
日本語 2.0
Unicode provides a unique number for every character
no matter what the platform,
no matter what the program,
no matter what the language.
Whipaptitude
with
Manipulexity
Because we want the easy jobs should be easy without making the hard jobs impossible.
no matter in what languages your jobs are done.
We wanted perl5 to be a language for getting multilingual jobs done.
English
“Latin” Lauguages
汉语
にホン語
한굴
漢語
Ελληνικά
Русский
עברית
عربية
You name it.
the utf8 pragma was introduced
Regexp OK!
s/小飼\s*弾/Dan Kogai/g
Enough?
No!
You could hardly tell if a scalar was a utf8 string or binary data.
There was no way to transcode from/to other encodings.
Unicode rules -- as an internal representation (Windows, MacOS X) while external representations remain "legacy"
Perl 5.6 = a city without bridges
The implicit utf8 flag was introduced to scalars
so strings are strings, binaries are binaries
perldoc perluniintro
The Encode modules was introduced
(I did)
so you can transcode from/to "legacy" encodings
perldoc Encode
On perl 5.8, only scalars marked as utf8 are treated as such
even when the string is a valid utf8
You can use Encode::is_utf8(), utf8::is_utf8() or Devel::Peek to tell the flag.
On perl 5.8, you can use Encode; to turn your data into utf8 and vice versa
Started by Nick Ing-Simmons,
Maintained by Dan Kogai.
use strict;
use utf8;
use Encode;
for my $argv (@ARGV){
open my $fh, "<", $argv or die "$argv : $!";
while(<$fh>){
my $utf8 = decode("eucjp", $_);
$utf8 =~ s{ (?:小飼|こがい|コガイ|Kogai)
[\s\x{3000}]* # \s + FULLWIDTH SPACE
(?:弾|だん|ダン|Dan)
}{Encode Maintainer}gmsx;
print encode("eucjp", $utf8);
}
}
use strict;
use utf8;
use Encode;
for my $argv (@ARGV){
open my $fh, "<encoding(eucjp)", $argv
or die "$argv : $!";
while(<$fh>){
$utf8 =~ s{ (?:小飼|こがい|コガイ|Kogai)
[\s\x{3000}]* # \s + FULLWIDTH SPACE
(?:弾|だん|ダン|Dan)
}{空気嫁}gmsx;
print encode("eucjp", $_);
}
}
use strict;
use utf8;
use Encode;
binmode(STDOUT, "encoding(eucjp)");
for my $argv (@ARGV){
open my $fh, "<encoding(eucjp)", $argv
or die "$argv : $!";
while(<$fh>){
$utf8 =~ s{ (?:小飼|こがい|コガイ|Kogai)
[\s\x{3000}]* # \s + FULLWIDTH SPACE
(?:弾|だん|ダン|Dan)
}{404 Replacement Not Found}gmsx;
print;
}
}
Do not assume your files are in UTF-8.
decode() before you use it.
encode() back when you are done.
Write your script in UTF-8.
If you can't, store your string literals elsewhere and let Encode handle it.
or let open(... "<encoding(foo)") and binmode() to take care of the transcoding.
Remember ASCII is just a part of the world.
So are Unicode and UTF-8.
There are more than one way to do it.
more than one way to spell
駱駝
Thank You!
Oh, one more thing.
(a la Steve Job)
Jcode.pm was a de facto standard to handle Japanese encodings.
Encode.pm is a de jure standard to handle any encodings
Encode.pm can handle everything Jcode.pm could and much more.
I thought that was enough.
print encode('euc-jp', decode('shiftjis', $bytes));
# vs
print Jcode->new($bytes, 'sjis')->euc;
People kept using Jcode.pm in spite of Encode.pm
What do we need?
Good Wrapper!
But I was procrastinating.
Then came JEncode.pm.
Sorry, I was not impatient enough.
But I still got my laziness for rescue.
Jcode 2.0
Jcode. 0.8x for Perl 5.6.x and below
Encode wrapper for Perl 5.8 and above.
print Jcode->new($bytes, 'sjis')->euc;
print Jcode->new($bytes, 'big5')
->fallback(Encode::FB_XMLCREF)->eucjp;
And they lived happily ever after.
By the way.
"There are more than one way to do it".say;
"やり方は一つじゃない".shiftjis.say;
qq(Do what I mean!)
.translate(from => 'english',
to => 'japanese',
via => 'http://translate.livedoor.com/'
).eucjp(fallback => PERLQQ).eval;
Well, let me think about it.
Thank You!
Questions?