How to Convert an UTF-16 File to an UTF-8 file using PHP

Taking Andrew Walker’s previously mentioned handy little UTF-16 to UTF-8 string converter function, we now have in our means a particularly easy way in which to craft a simple UTF-16 to UTF-8 file converter, useful as I have found in the past for those silly little cases like when someone is spitting out Microsoft SQL Server generated CSV files (which are by default encoded in UTF-16) at you for example.

So let’s put down the code then shall we?

function utf16_to_utf8($str) {
    $c0 = ord($str[0]);
    $c1 = ord($str[1]);
 
    if ($c0 == 0xFE && $c1 == 0xFF) {
        $be = true;
    } else if ($c0 == 0xFF && $c1 == 0xFE) {
        $be = false;
    } else {
        return $str;
    }
 
    $str = substr($str, 2);
    $len = strlen($str);
    $dec = '';
    for ($i = 0; $i < $len; $i += 2) {
        $c = ($be) ? ord($str[$i]) << 8 | ord($str[$i + 1]) :
                ord($str[$i + 1]) << 8 | ord($str[$i]);
        if ($c >= 0x0001 && $c <= 0x007F) {
            $dec .= chr($c);
        } else if ($c > 0x07FF) {
            $dec .= chr(0xE0 | (($c >> 12) & 0x0F));
            $dec .= chr(0x80 | (($c >>  6) & 0x3F));
            $dec .= chr(0x80 | (($c >>  0) & 0x3F));
        } else {
            $dec .= chr(0xC0 | (($c >>  6) & 0x1F));
            $dec .= chr(0x80 | (($c >>  0) & 0x3F));
        }
    }
    return $dec;
}
 
function convert_file_to_utf8($csvfile) {
    $utfcheck = file_get_contents($csvfile);
    $utfcheck = utf16_to_utf8($utfcheck);
    file_put_contents($csvfile,$utfcheck);
}

To convert a file simply call the convert_file_to_utf8() function and pass to it the file path of the file you wish to convert. The function then uses the PHP function file_get_contents() to pack the input file’s contents into a string variable which is then passed to the main converter function which converts the string from UTF-16 to UTF-8 encoding if necessary. Finally, we use file_put_contents() to stuff the resulting string back into the original file, overwriting the original file contents.

Nice and simple really.

About Craig Lotter

Craig Lotter is a 29-ish year old software and web developer by trade (currently working for Touchwork), who also just happens to never have been able to shake off that pesky inner child within. Call him a fanboy, geek, nerd or whatever you want, just so long as you enjoy what he writes. His main personal site can be found at http://www.craiglotter.co.za and his webcomic, House of C can be found at http://www.houseofc.codeunit.co.za/
This entry was posted in Technology & Code, Tutorials and tagged , , , , , , , , . Bookmark the permalink.