UTF-8 conversion
If upstream documents are encoded in old encoding schemes, converting them to UTF-8 is a good idea.
-
Use iconv(1) to convert encodings of plain text files.
iconv -f latin1 -t utf8
foo_in.txt
>foo_out.txt
-
Use w3m(1) to convert from HTML files to UTF-8 plain text files. When you do this, make sure to execute it under UTF-8 locale.
LC_ALL=C.UTF-8 w3m -o display_charset=UTF-8 \ -cols 70 -dump -no-graph -T text/html \ <
foo_in.html
>foo_out.txt