Utf 8 python ошибка

Доброго дня суток.В инете искал не нашел ответ.
В чем проблема , ну с вижуалкой игрался , вроде никак не связано с Python.
Ну вот потом поставил питон в вижуалку пробую и заметил что кириллицу не выводит.
Выдает что с UTF-8 проблема.
Зашел в другую среду Eclipse опять та же проблема(хотя до этого все работало).
Нашел строку : # -*- coding: cp1251 -*-
Если ее писать в начале то работает.Честно хз почему.
Хотелось бы узнать из-за чего могло сломаться , и как починить что бы не писать каждый раз эту строку.


  • Вопрос задан

    более трёх лет назад

  • 4300 просмотров

Here is a little tmp.py with a non ASCII character:

if __name__ == "__main__":
    s = 'ß'
    print(s)

Running it I get the following error:

Traceback (most recent call last):
  File ".tmp.py", line 3, in <module>
    print(s)
  File "C:Python32libencodingscp866.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character 'xdf' in position 0: character maps to <undefined>

The Python docs says:

By default, Python source files are treated as encoded in UTF-8…

My way of checking the encoding is to use Firefox (maybe someone would suggest something more obvious). I open tmp.py in Firefox and if I select View->Character Encoding->Unicode (UTF-8) it looks ok, that is the way it looks above in this question (wth ß
symbol).

If I put:

# -*- encoding: utf-8 -*-

as the first string in tmp.py it does not change anything—the error persists.

Could someone help me to figure out what am I doing wrong?

PEP: Python3 and UnicodeDecodeError

This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. It’s a draft, don’t hesitate to comment it. This document suppose that my patch to allow bytes filenames is accepted which is not the case today.

While I was writing this document I found poential problems in Python3. So here is a TODO list (things to be checked):

  • FIXME: When bytearray is accepted or not?
  • FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get bytes or unicode?

Can anyone write a section about bytes encoding in Unicode using escape sequence?

What is the best tool to work on a PEP? I hate email threads, and I would prefer SVN / Mercurial / anything else.


Python3 and UnicodeDecodeError for the command line, environment variables and filenames

Introduction

Python3 does its best to give you texts encoded as a valid unicode characters strings. When it hits an invalid bytes sequence (according to the used charset), it has two choices: drops the value or raises an UnicodeDecodeError. This document present the behaviour of Python3 for the command line, environment variables and filenames.

Example of an invalid bytes sequence: ::

>>> str(b'xff', 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...)

whereas the same byte sequence is valid in another charset like ISO-8859-1: ::

>>> str(b'xff', 'iso-8859-1')
'ÿ'

Default encoding

Python uses «UTF-8» as the default Unicode encoding. You can read the default charset using sys.getdefaultencoding(). The «default encoding» is used by PyUnicode_FromStringAndSize().

A function sys.setdefaultencoding() exists, but it raises a ValueError for charset different than UTF-8 since the charset is hardcoded in PyUnicode_FromStringAndSize().

Command line

Python creates a nice unicode table for sys.argv using mbstowcs(): ::

$ ./python -c 'import sys; print(sys.argv)' 'Ho hé !'
['-c', 'Ho hé !']

On Linux, mbstowcs() uses LC_CTYPE environement variable to choose the encoding. On an invalid bytes sequence, Python quits directly with an exit code 1. Example with UTF-8 locale:

$ python3.0 $(echo -e 'invalid:xff')
Could not convert argument 1 to string

Environment variables

Python uses «_wenviron» on Windows which are contains unicode (UTF-16-LE) strings.  On other OS, it uses «environ» variable and the UTF-8 charset. It drops a variable if its key or value is not convertible to unicode. Example:

env -i HOME=/home/my PATH=$(echo -e "xff") python
>>> import os; list(os.environ.items())
[('HOME', '/home/my')]

Both key and values are unicode strings. Empty key and/or value are allowed.

Python ignores invalid variables, but values still exist in memory. If you run a child process (eg. using os.system()), the «invalid» variables will also be copied.

Filenames

Introduction

Python2 uses byte filenames everywhere, but it was also possible to use unicode filenames. Examples:

  • os.getcwd() gives bytes whereas os.getcwdu() always returns unicode
  • os.listdir(unicode) creates bytes or unicode filenames (fallback to bytes on UnicodeDecodeError), os.readlink() has the same behaviour

  • glob.glob() converts the unicode pattern to bytes, and so create bytes filenames
  • open() supports bytes and unicode

Since listdir() mix bytes and unicode, you are not able to manipulate easily filenames:

>>> path=u'.'
>>> for name in os.listdir(path):
...  print repr(name)
...  print repr(os.path.join(path, name))
...
u'valid'
u'./valid'
'invalidxff'
Traceback (most recent call last):
...
File "/usr/lib/python2.5/posixpath.py", line 65, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff (...)

Python3 supports both types, bytes and unicode, but disallow mixing them. If you ask for unicode, you will always get unicode or an exception is raised.

You should only use unicode filenames, except if you are writing a program fixing file system encoding, a backup tool or you users are unable to fix their broken system.

Windows

Microsoft Windows since Windows 95 only uses Unicode (UTF-16-LE) filenames. So you should only use unicode filenames.

Non Windows (POSIX)

POSIX OS like Linux uses bytes for historical reasons. In the best case, all filenames will be encoded as valid UTF-8 strings and Python creates valid unicode strings. But since system calls uses bytes, the file system may returns an invalid filename, or a program can creates a file with an invalid filename.

An invalid filename is a string which can not be decoded to unicode using the default file system encoding (which is UTF-8 most of the time).

A robust program will have to use only the bytes type to make sure that it can open / copy / remove any file or directory.

Filename encoding

Python use:

  • «mbcs» on Windows
  • or «utf-8» on Mac OS X
  • or nl_langinfo(CODESET) on OS supporting this function
  • or UTF-8 by default

«mbcs» is not a valid charset name, it’s an internal charset saying that Python will use the function MultiByteToWideChar() to decode bytes to unicode. This function uses the current codepage to decode bytes string.

You can read the charset using sys.getfilesystemencoding(). The function may returns None if Python is unable to determine the default encoding.

PyUnicode_DecodeFSDefaultAndSize() uses the default file system encoding, or UTF-8 if it is not set.

On UNIX (and other operating systems), it’s possible to mount different file systems using different charsets. sys.getdefaultencoding() will be the same for the different file systems since this encoding is only used between Python and the Linux kernel, not between the kernel and the file system which may uses a different charset.

Display a filename

Example of a function formatting a filename to display it to human eyes: ::

from sys import getfilesystemencoding
def format_filename(filename):
    return str(filename, getfilesystemencoding(), 'replace')

Example: format_filename(‘rxffport.doc’) gives ‘r�port.doc’ with the UTF-8 encoding.

Functions producing filenames

Policy: for unicode arguments: drop invalid bytes filenames; for bytes arguments: return bytes

  • os.listdir()
  • glob.glob()

This behaviour (drop silently invalid filenames) is motivated by the fact to if a directory of 1000 files only contains one invalid file, listdir() fails for the whole directory. Or if your directory contains 1000 python scripts (.py) and just one another document with an invalid filename (eg. r�port.doc), glob.glob(‘*.py’) fails whereas all .py scripts have valid filename.

Policy: for an unicode argument: raise an UnicodeDecodeError on invalid filename; for an bytes argument: return bytes

  • os.readlink()

Policy: create unicode directory or raise an UnicodeDecodeError

  • os.getcwd()

Policy: always returns bytes

  • os.getcwdb()

Functions for filename manipulation

Policy: raise TypeError on bytes/str mix

  • os.path.*(), eg. os.path.join()
  • fnmatch.*()

Functions accessing files

Policy: accept both bytes and str

  • io.open()
  • os.open()
  • os.chdir()
  • os.stat(), os.lstat()
  • os.rename()
  • os.unlink()
  • shutil.*()

os.rename(), shutil.copy*(), shutil.move() allow to use bytes for an argment, and unicode for the other argument

bytearray

In most cases, bytearray() can be used as bytes for a filename.

Unicode normalisation

Unicode characters can be normalized in 4 forms: NFC, NFD, NFKC or NFKD. Python does never normalize strings (nor filenames). No operating system does normalize filenames. So the users using different norms will be unable to retrieve their file. Don’t panic! All users use the same norm.

Use unicodedata.normalize() to normalize an unicode string.

Back to top

Toggle table of contents sidebar

Ошибки при конвертации#

При конвертации между строками и байтами очень важно точно знать, какая
кодировка используется, а также знать о возможностях разных кодировок.

Например, кодировка ASCII не может преобразовать в байты кириллицу:

In [32]: hi_unicode = 'привет'

In [33]: hi_unicode.encode('ascii')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-33-ec69c9fd2dae> in <module>()
----> 1 hi_unicode.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

Аналогично, если строка «привет» преобразована в байты, и попробовать
преобразовать ее в строку с помощью ascii, тоже получим ошибку:

In [34]: hi_unicode = 'привет'

In [35]: hi_bytes = hi_unicode.encode('utf-8')

In [36]: hi_bytes.decode('ascii')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-36-aa0ada5e44e9> in <module>()
----> 1 hi_bytes.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Еще один вариант ошибки, когда используются разные кодировки для
преобразований:

In [37]: de_hi_unicode = 'grüezi'

In [38]: utf_16 = de_hi_unicode.encode('utf-16')

In [39]: utf_16.decode('utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-39-4b4c731e69e4> in <module>()
----> 1 utf_16.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Наличие ошибок — это хорошо. Они явно говорят, в чем проблема.
Хуже, когда получается так:

In [40]: hi_unicode = 'привет'

In [41]: hi_bytes = hi_unicode.encode('utf-8')

In [42]: hi_bytes
Out[42]: b'xd0xbfxd1x80xd0xb8xd0xb2xd0xb5xd1x82'

In [43]: hi_bytes.decode('utf-16')
Out[43]: '뿐胑룐닐뗐苑'

Обработка ошибок#

У методов encode и decode есть режимы обработки ошибок, которые
указывают, как реагировать на ошибку преобразования.

Параметр errors в encode#

По умолчанию encode использует режим strict — при возникновении ошибок
кодировки генерируется исключение UnicodeError. Примеры такого поведения
были выше.

Вместо этого режима можно использовать replace, чтобы заменить символ
знаком вопроса:

In [44]: de_hi_unicode = 'grüezi'

In [45]: de_hi_unicode.encode('ascii', 'replace')
Out[45]: b'gr?ezi'

Или namereplace, чтобы заменить символ именем:

In [46]: de_hi_unicode = 'grüezi'

In [47]: de_hi_unicode.encode('ascii', 'namereplace')
Out[47]: b'gr\N{LATIN SMALL LETTER U WITH DIAERESIS}ezi'

Кроме того, можно полностью игнорировать символы, которые нельзя
закодировать:

In [48]: de_hi_unicode = 'grüezi'

In [49]: de_hi_unicode.encode('ascii', 'ignore')
Out[49]: b'grezi'

Параметр errors в decode#

В методе decode по умолчанию тоже используется режим strict и
генерируется исключение UnicodeDecodeError.

Если изменить режим на ignore, как и в encode, символы будут просто
игнорироваться:

In [50]: de_hi_unicode = 'grüezi'

In [51]: de_hi_utf8 = de_hi_unicode.encode('utf-8')

In [52]: de_hi_utf8
Out[52]: b'grxc3xbcezi'

In [53]: de_hi_utf8.decode('ascii', 'ignore')
Out[53]: 'grezi'

Режим replace заменит символы:

In [54]: de_hi_unicode = 'grüezi'

In [55]: de_hi_utf8 = de_hi_unicode.encode('utf-8')

In [56]: de_hi_utf8.decode('ascii', 'replace')
Out[56]: 'gr��ezi'
Table of Contents
Hide
  1. What is UnicodeDecodeError ‘utf8’ codec can’t decode byte?
  2. Solution for Importing and Reading CSV files using Pandas
  3. Solution for Loading and Parsing JSON files
  4. Solution for Loading and Parsing any other file formats
  5. Solution for decoding the string contents efficiently

The UnicodeDecodeError occurs mainly while importing and reading the CSV or JSON files in your Python code. If the provided file has some special characters, Python will throw an UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte.

The UnicodeDecodeError normally happens when decoding a string from a certain coding. Since codings map only a limited number of str strings to Unicode characters, an illegal sequence of str characters (non-ASCII) will cause the coding-specific decode() to fail.

When importing and reading a CSV file, Python tries to convert a byte-array (bytes which it assumes to be a utf-8-encoded string) to a Unicode string (str). It is a decoding process according to UTF-8 rules. When it tries this, it encounters a byte sequence that is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Example

import pandas as pd
a = pd.read_csv("filename.csv")

Output

Traceback (most recent call last):
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 2: invalid start byte

There are multiple solutions to resolve this issue, and it depends on the different use cases. Let’s look at the most common occurrences, and the solution to each of these use cases.

Solution for Importing and Reading CSV files using Pandas

If you are using pandas to import and read the CSV files, then you need to use the proper encoding type or set it to unicode_escape to resolve the UnicodeDecodeError as shown below.

import pandas as pd
data=pd.read_csv("C:\Employess.csv",encoding=''unicode_escape')
print(data.head())

Solution for Loading and Parsing JSON files

If you are getting UnicodeDecodeError while reading and parsing JSON file content, it means you are trying to parse the JSON file, which is not in UTF-8 format. Most likely, it might be encoded in ISO-8859-1. Hence try the following encoding while loading the JSON file, which should resolve the issue.

json.loads(unicode(opener.open(...), "ISO-8859-1"))

Solution for Loading and Parsing any other file formats

In case of any other file formats such as logs, you could open the file in binary mode and then continue the file read operation. If you just specify only read mode, it opens the file and reads the file content as a string, and it doesn’t decode properly.

You could do the same even for the CSV, log, txt, or excel files also.

with open(path, 'rb') as f:
  text = f.read()

Alternatively, you can use decode() method on the file content and specify errors=’replace’ to resolve UnicodeDecodeError 

with open(path, 'rb') as f:
  text = f.read().decode(errors='replace')

When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn’t use errors='replace', so if there are any characters in the Unicode string that aren’t in the default encoding (probably ASCII) you’ll get a UnicodeEncodeError.

(Python 3 no longer does this as it is terribly confusing.)

Check the type of message and assuming it is indeed Unicode, works back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.

Solution for decoding the string contents efficiently

If you encounter UnicodeDecodeError while reading a string variable, then you could simply use the encode method and encode into a utf-8 format which inturns resolve the error.

str.encode('utf-8').strip()

Avatar Of Srinivas Ramakrishna

Srinivas Ramakrishna is a Solution Architect and has 14+ Years of Experience in the Software Industry. He has published many articles on Medium, Hackernoon, dev.to and solved many problems in StackOverflow. He has core expertise in various technologies such as Microsoft .NET Core, Python, Node.JS, JavaScript, Cloud (Azure), RDBMS (MSSQL), React, Powershell, etc.

Sign Up for Our Newsletters

Subscribe to get notified of the latest articles. We will never spam you. Be a part of our ever-growing community.

By checking this box, you confirm that you have read and are agreeing to our terms of use regarding the storage of the data submitted through this form.

Понравилась статья? Поделить с друзьями:
  • Uss500 билайн ошибка
  • Usrtbased dll ошибка
  • Using tensorflow backend ошибка
  • Using render selected with empty selection ошибка
  • Using namespace system c ошибка