Perl, Spreadsheet::ParseExcel, Parsing an XLS file with mixed encodings
February 3rd, 2010
No comments
We are using the Perl module, Spreadsheet::ParseExcel->new to parse XLS files and insert the results into a UTF8 MySQL database.
I came across an interesting situation with a Microsoft Excel file, where only 3 fields of the 12 are encoded as UTF-16BE, the rest are latin1.
Parsing a file with mixed encoding returns unexpected results, since the data in the spreadsheet looks correct, but parsing in it one receives incorrect characters.
Here is some code using Encode::Guess which will resolve the mixed encoding issue.
my $enc = guess_encoding($text); # use Encode::Guess;
if (ref $enc) {
$descr = decode($enc->name, $text);
}
else { # convert to utf8, does not turn on utf8 flag
from_to($text, "iso-8859-1", "utf8");
}
or, if you are already expecting UTF8 data, or more specifically, data with the utf8 flag turned on (perl 5.8.* - see perldoc Encode)
use Encode qw/decode from_to/;
use Encode::Guess;
if (! utf8::is_utf8($val)) { # ok, its not utf8, what is it?
my $enc = guess_encoding($val); # use Encode::Guess;
if (ref $enc) {
$val = decode($enc->name, $val); #turns on utf8 flag
}
}
(you would have to use these on a field-by-field/val-by-val basis obviously)
