Changing the File Encoding in Unix
From time to time, Unix users have to deal with files sent by colleagues that are not properly encoded and this may cause a bit of a trouble.
You may find your self trying to write a shell script around a file, but a cat only spits gibberish. Or, if you try to work with something a bit more elaborated like grep or sed, it will not find any match. Huh? Chances are that the file encoding is not ASCII and the standard Unix tools will not understand its content.
$ less myfile.txt
"myfile.txt" may be a binary file. See it anyway?
$ file myfile.txt
myfile.txt: Little-endian UTF-16 Unicode text, with CR, LF line terminators
At this point, there are two options. You can write a small script with your favourite language or use the standard tools in your system, instead. The final result will be the same, because all use the same set of functions for the conversions.
If you want to program a bit, the following languages use iconv for character set conversion.
- C http://www.kernel.org/doc/man-pages/online/pages/man3/iconv.3.html.
- PHP http://php.net/manual/en/book.iconv.php.
- Perl http://search.cpan.org/dist/Text-Iconv/Iconv.pm.
- Python http://pypi.python.org/pypi/iconv/1.0.
- Ruby http://ruby-doc.org/stdlib-1.9.2/libdoc/iconv/rdoc/Iconv.html.
If you are writing a shell script, you can always use the iconv program. It accepts the input/output encodings and the source file. It will send the result to the standard output.
$ iconv -f utf16 -t ascii oldfile > newfile