unicode - Python 3 chokes on CP-1252/ANSI reading -


I am working on a series of parsers where I get a group of tracebacks from my unit test:

  file "c: \ Python31 \ lib \ encodings \ cp1252.py", line 23, in decode return codecs.charmap_decode (input, self.errors, decoding_table) [0] Unicodecode error: 'charmap' Codec can not be decoded byte 0x81 in position 112: Characters Maps can be & lt; Undefined & gt;  

The files are open () with open () with no additional arguemnts. Can I send additional logic to open the codec module to open it?

It came with code that was written in Python 2 and was changed to 3 with 2to3 tool. Update: It is revealed that this is a result of feeding a zipfile in the parser. Unit testing is actually expected to happen to the parser, it should be recognized as something that can not be parsed. Therefore, I have to make changes to deal with my exception. In the process of doing so

The position 0x81 is assigned in Windows-1252 (aka cp1252). It has been assigned the U + 0081 High Ocetate Presets (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 in this way:

  & gt; & Gt; & Gt; B '\ x81'.decode (' cp1252 ') traceback (most recent call final): ... Unicodecode error:' charmap 'codec can not decode byte 0x81 in position 0: Character Map & lt; Undefined & gt;  

Or with the actual file:

  gt; & Gt; & Gt; Open ('test.txt', 'wb'). Write (b '\ x81 \ n') 2> gt; & Gt; & Gt; Read ('test.txt') open () traceback (most recent call final): ... Unicodecode error: 'utf8' can not decode byte 0x81 in codec state 0: Unexpected code byte  < / Pre> 

Now you pass encoding logic to behave as Latin-1, such as the codepe suggested:

 < Code >> gt; & Gt; Open ('test.txt', encoding = 'Latin-1'). Read () '\ x81 \ n'  

Be careful that there is a difference between Windows-1257 and Latin-1 encoding, such as Latin-1 does not have "smart quotes" if you If the processing file is a text file, then ask yourself what \ x81 is doing in it.


Comments