Home > cpan, perl > Perl, Spreadsheet::ParseExcel, Parsing an XLS file with mixed encodings

Perl, Spreadsheet::ParseExcel, Parsing an XLS file with mixed encodings

We are using the Perl module, Spreadsheet::ParseExcel->new to parse XLS files and insert the results into a UTF8 MySQL database.

I came across an interesting situation with a Microsoft Excel file, where only 3 fields of the 12 are encoded as UTF-16BE, the rest are latin1.

Parsing a file with mixed encoding returns unexpected results, since the data in the spreadsheet looks correct, but parsing in it one receives incorrect characters.

Here is some code using Encode::Guess which will resolve the mixed encoding issue.

my $enc = guess_encoding($text); # use Encode::Guess;
if (ref $enc) {
    $descr = decode($enc->name, $text);
}
else { # convert to utf8, does not turn on utf8 flag
  from_to($text, "iso-8859-1", "utf8");
}

or, if you are already expecting UTF8 data, or more specifically, data with the utf8 flag turned on (perl 5.8.* - see perldoc Encode)

use Encode qw/decode from_to/;
use Encode::Guess;
if (! utf8::is_utf8($val)) { # ok, its not utf8, what is it?
    my $enc = guess_encoding($val); # use Encode::Guess;
    if (ref $enc) {
        $val = decode($enc->name, $val); #turns on utf8 flag
    }
}

(you would have to use these on a field-by-field/val-by-val basis obviously)

  1. No comments yet.