Python ошибка encoding - Не ошибается лишь тот, кто ничего не делает!

Время на прочтение
9 мин

Количество просмотров 58K

Зачем эта статья?

Об обработке текстов на естественном языке сейчас знают все. Все хоть раз пробовали задавать вопрос Сири или Алисе, пользовались Grammarly (это не реклама), пробовали генераторы стихов, текстов… или просто вводили запрос в Google. Да, вот так просто. На самом деле Google понимает, что вы от него хотите, благодаря штукам, которые умеют обрабатывать и анализировать естественную речь в вашем запросе.

При анализе текста мы можем столкнуться с ситуациями, когда текст содержит специфические символы, которые необходимо проанализировать наравне с «простым текстом» (взять даже наши горячо любимые вставки на французском из «Война и мир») или формулы, например. В таком случае обработка текста может усложниться.

Вы можете заметить, что если ввести в поисковую строку запрос с символами с ударением (так называемый модифицирующий акут), к примеру «ó», поисковая система может показать результаты, содержащие слова из вашего запроса, символы с ударением уже выглядят как обычные символы.

Обратите внимание на следующий запрос:

Запрос содержит символ с модифицирующим акутом, однако во втором результате мы можем заметить, что выделено найденное слово из запроса, только вот оно не содержит вышеупомянутый символ, просто букву «о».

Конечно, уже есть много готовых инструментов, которые довольно неплохо справляются с обработкой текстов и могут делать разные крутые вещи, но я не об этом хочу вам поведать. Я не буду рассказывать про nltk, стемминг, лемматизацию и т.п. Я хочу опуститься на несколько ступенек ниже и обсудить некоторые тонкости кодировок, байтов, их обработки.

Откуда взялась статья?

Одним из важных составляющих в области ИИ является обработка текстов на естественном языке. В процессе изучения данной тематики я начал задавать себе вопросы, которые в конечном итоге привели меня к изучению кодировок, представлению текстов в памяти, как они преобразуются, приводятся к нормальной форме. Я плохо понимал эту тему в начале, потребовалось немало времени и мозгового ресурса, чтобы понять, принять и запомнить некоторые вещи. Написанием данной статьи я хочу облегчить жизнь людям, которые столкнутся с необходимостью чтения и обработки текстов на Python и самому закрепить изученное. А некоторыми полезными поинтами своего изучения я постараюсь поделиться в данной статье.

Важная ремарка: я не являюсь специалистом в области обработки текстов. Изложенный материал является результатом исключительно любительского изучения.

Проблема чтения файлов

Допустим, у нас есть файл с текстом. Нам нужно этот текст прочитать. Казалось бы, пиши себе такой вот скрипт для чтения из файла да и радуйся:

with open("some_text.txt", "r") as file:
    content = file.read()

print(content)

В файле содержится вот такое вот изречение:

pitón

что переводится с испанского как питон. Однако консоль OC Windows 10 покажет нам немного другой результат:

C:myhabrTextsInPython> python .script1.py
pitÃ³n

Сейчас мы разберёмся, что именно пошло не так и по какой причине.

Кодировка

Думаю, это не будет сюрпризом, если я скажу, что любой символ, который заносится в память компьютера, хранится в виде числа, а не в виде литерала. Это число определяется как идентификатор или кодовая позиция символа. Кодировка определяет, какое именно число будет ассоциировано с символом.

Предположим, у нас есть некоторый файл с неизвестным содержимым, и нам нужно его прочитать, однако мы не знаем, какая у файла кодировка. Попробуем декодировать содержимое файла.

with open("simple_text.txt", "r") as file:
    text = file.read()
print(text)

Посмотрим на результат:

C:myhabrTextsInPython> python .script2.py
ÿþ<♦8♦@♦

Очень интересно, ничего непонятно. По умолчанию Python использует кодировку utf-8, но видимо запись в файл происходила не с её помощью. Здесь нам придёт на помощь дополнительный параметр функции open — параметр encoding, который позволяет указать конкретную кодировку, в которой следует прочитать файл (или записывать в него). Попробуем перебрать несколько кодировок и найти подходящую.

codecs = ["cp1252", "cp437", "utf-16be", "utf-16"]

for codec in codecs:
    with open("simple_text.txt", "r", encoding=codec) as file:
        text = file.read()
    print(codec.rjust(12), "|", text)

Результат:

C:myhabrTextsInPython> python .script3.py
      cp1252 | ÿþ<8@
       cp437 |  ■<8@
    utf-16be | 㰄㠄䀄
      utf-16 | мир

Разные кодировки расшифровывают байты из файла по-разному, то есть разным кодовым позициям могут соотвествовать разные символы. Пример примитивный, несложно догадаться, что истинная кодировка файла — это utf-16.

Важный поинт: при записи и чтении из файлов следует указывать конкретную кодировку, это позволит избежать путаницы в дальнейшем.

Ошибки, связанные с кодировками

При возникновении ошибки, связанной с кодировками, интерпретатор выдаст одно из следующих исключений:

UnicodeError. Это общее исключение для ошибок кодировки.
UnicodeDecodeError. Данное исключение возбуждается, если встречается кодовая позиция, которая отсутствует в кодировке.
UnicodeEncodeError. А это исключение возбуждается, когда символ, который необходимо закодировать, незнаком для кодировки.

Попытка выполнения вот такого кода (в файле всё ещё содержится испанский питон):

with open("some_text.txt", "r", encoding="ascii") as file:
    file.read()

даст нам следующий результат:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Кодировка ASCII не поддерживает никакой алфавит, кроме английского. Поэтому декодирование символа «ó» вызывает у ASCII сложности. Однако Python всемогущ и есть механизм, который позволяет обработать ошибки кодировок. Это дополнительный параметр методов encode и decode — параметр errors. Он может принимать следующие значения:

Для обеих функций:

Обозначение	Суть
`strict`	Значение по умолчанию. Несоотвествующие кодировке символы возбуждают исключения `UnicodeError` и наследуемые от него.
`ignore`	Несоответсвующие символы пропускаются без возбуждения исключений.

Только для метода encode:

Обозначение	Суть
`replace`	Несоотвествующие символы заменяются на символ `?`
`xmlcharrefreplace`	Несоответствующие символы заменяются на соответсвующие значения XML.
`backslashreplace`	Несоответствующие символы заменяются на определённые последовательности с обратным слэшем.
`namereplace`	Несоответствующие символы заменяются на имена этих символов, которые берутся из базы данных Unicode.

Также отдельно выделены значения surrogatepass и surrogateescape.

Приведём пример использования таких обработчиков:

>>> text = "pitón"
>>> text.encode("ascii", errors="ignore")
b'pitn'
>>> text.encode("ascii", errors="replace")
b'pit?n'
>>> text.encode("ascii", errors="xmlcharrefreplace")
b'pitón'
>>> text.encode("ascii", errors="backslashreplace")
b'pit\xf3n'
>>> text.encode("ascii", errors="namereplace")
b'pit\N{LATIN SMALL LETTER O WITH ACUTE}n'

Важный поинт: если в текстах могут встретиться неожиданные для кодировки символы, во избежание возбуждения исключений можно использовать обработчики.

Cворачивание регистра

Сворачивание регистра — это попытка унифицировать текст любого представления к канонической форме. Например, приведение всего текста в нижний регистр. Также над текстом производятся некоторые преобразования (например, немецкая «эсцет» — «ß» — преобразуется в «ss»). В Python 3.3 появился метод str.casefold(), который как раз выполняет сворачивание регистра. Если текст содержит только символы кодировки latin1, результат применения этого метода будет аналогичен методу str.lower().

И по классике приведём пример:

>>> text = "Die größte Stadt der Welt liegt in China"
>>> text.casefold()
'die grösste stadt der welt liegt in china'

В результате применённый метод не только привёл весь текст к нижнему регистру, но и преобразовал специфический немецкий символ.

Важный поинт: привести текст можно не только методом str.lower(), но и методом str.casefold(), который может выполнить дополнительные преобразования текста.

Нормализация

Нормализация — это полноценное приведение текста к единому представлению.

Чтобы обозначить важность нормализации, приведём простой пример:

letter1 = "µ"
letter2 = "μ"

Внешне два этих символа выглядят абсолютно одинаково. Однако если мы попытаемся вывести имена этих символов, как их видит интерпретатор Python’a, результат нас порядком удивит.

В Python есть отличный встроенный модуль, который содержит данные о символах Unicode, их имена, являются ли они цифрамии и т.п. (методы по типу str.isdigit() берут информацию из этих данных). Воспользуемся данным модулем, чтобы вывести имена символов, исходя из информации, которая содержится в базе данных Unicode.

import unicodedata

letter1 = "µ"
letter2 = "μ"
print(unicodedata.name(letter1))
print(unicodedata.name(letter2))

Результат выполнения данного кода:

C:myhabrTextsInPython> python .script7.py
MICRO SIGN
GREEK SMALL LETTER MU

Итак, интерпретатор Python’a видит эти символы как два разных, но в стандарте Unicode они имеют одинаковое отображение.Такие символы называют каноническими эквивалентами. Приложения будут считать два этих символа одинаковыми, но не интерпретатор.

Посмотрим на ещё один пример:

>>> s1 = 'café'
>>> s2 = 'cafeu0301'
>>> s1, s2
('café', 'café')
>>> s1 == s2
False
>>> len(s1), len(s2)
(4, 5)

Данные символы также будут являться каноническими эквивалентами. Из примера мы видим, что символ «é» в стандарте Unicodeможет быть представлен двумя способами, которые к тому же имеют разную длину. Символ «é» может быть представлен одним или двумя байтами.

Решением таких конфликтов занимается нормализация. Она реализована в Python в функции unicodedata.normalize.Первым аргумент является так называемая форма нормализации — нормализации строк Unicode, которые позволяют определить, эквивалентны ли какие-либо две строки Unicode друг другу. Всего предлагается четыре формы:

Форма	Описание
Normalization Form D (NFD)	Canonical Decomposition
Normalization Form C (NFC)	Canonical Decomposition, следующая за Canonical Composition
Normalization Form KD (NFKD)	Compatibility Decomposition
Normalization Form KC (NFKC)	Compatibility Decomposition, следующая за Canonical Composition

Разберём каждую форму немного подробнее.

При указании данной формы нормализации происходит каноническая композиция (как, собственно, и гласит название) кодовых позиций с целью получения самой короткой эквивалентной строки.

>>> unicodedata.normalize("NFC", s1), unicodedata.normalize("NFC", s2)
('café', 'café')
>>> len(unicodedata.normalize("NFC", s1)), len(unicodedata.normalize("NFC", s2))
(4, 4)
>>> unicodedata.normalize("NFC", s1) == unicodedata.normalize("NFC", s2)
True
>>> len(unicodedata.normalize("NFC", s1)) == len(unicodedata.normalize("NFC", s2))
True

Итак, нормализация обеих строк внешне их не изменила, однако длина строки s2 стала равной 4 (т.е. на один байт меньше). Была произведена композиция байтов eu0301, которые являлись отображением «é». Данная последовательность была заменена на минимальное представление символа, т.е. теперь представление этого символа для интерпретатора выглядит как в строке s1. Как результат, мы видим, что длина нормализованных строк стала равной, и сами строки также стали равны.

С этой формой ситуация аналогичная, только происходит декомпозиция байтов, т.е. разложение символа на несколько байт.

>>> unicodedata.normalize("NFD", s1), unicodedata.normalize("NFD", s2)
('café', 'café')
>>> len(unicodedata.normalize("NFD", s1)), len(unicodedata.normalize("NFD", s2))
(5, 5)
>>> unicodedata.normalize("NFD", s1) == unicodedata.normalize("NFD", s2)
True
>>> len(unicodedata.normalize("NFD", s1)) == len(unicodedata.normalize("NFD", s2))
True

Здесь мы видим, что длина строки s1 увеличилась на один байт. Думаю, уже несложно догадаться, почему.

На данном этапе настал момент ввести понятие символа совместимости. Символы совместимости (compatibility characters) были введены в Unicode ради совместимости с другими стандартами, в частности, стандарты, которые предшествовали Unicode. Это означает, что некоторые символы могут встречаться в стандарте несколько раз. Мы уже могли наблюдать это явление в начале этого раздела на примере с символом «мю». Он считается символом совместимости.

NFKC и NFKD

При данных формах нормализации символы совместимости заменяются на его более предпочтительное представление, что также называется совместимой декомпозицией. Однако при данных формах нормализации может быть потеряно форматирование.

Немного модифицируем наш пример из начала раздела. Выведем кодовые позиции символов до и после нормализации:

import unicodedata

letter1 = "µ"
letter2 = "μ"
print("Before normalizing:", ord(letter1), ord(letter2))
letter1 = unicodedata.normalize("NFKC", letter1)
letter2 = unicodedata.normalize("NFKC", letter2)
print("After normalizing:", ord(letter1), ord(letter2))

И результат выполнения кода:

Before normalizing: 181 956
After normalizing: 956 956

Итак, мы видим, что первый символ (который являлся знаком «микро») был заменён на греческую «мю», т.е. более предпочтительное представление символа. Таким образом, если необходимо, например, провести частотный анализ текста, формы нормализации, которые затрагивают символы совместимости, могут помочь с этим, приводя символы совместимости к единому представлению.

Важный поинт: нормализация может очень помочь для поиска валидных документов или индексирования текста. Если вы занимаетесь разработкой таких систем, не стоит сбрасывать алгоритмы нормализации со счетов.

Дополнительные материалы: что использовалось в статье и что почитать по теме

«Fluent Python», Лучано Ромальо

В этой книге целая глава посвящена изучению строк, байтов и Unicode (Глава 4. Тексты и байты). Она есть на русском и английском языках, но в русском переводе допущено немало ошибок, так что открывайте русский вариант на свой страх и риск. Материал статьи в большей степени опирается на данную книгу. Некоторые примеры также взяты оттуда.

Документация для Unicode на официальном сайте Python

Куда ж без неё, родимой. Там тоже можно найти немало полезной информации, если вам понадобится работать с текстами и делать больше, чем просто считывание из файла. Хотя в некоторых случаях и на этом можно споткнуться.

Unicode® Standard Annex

Это части стандарта Unicode, которые выложены в открытый доступ в виде отдельных статей. Почитать их можно вот здесь.

Источник

Toggle table of contents sidebar

Ошибки при конвертации#

При конвертации между строками и байтами очень важно точно знать, какая
кодировка используется, а также знать о возможностях разных кодировок.

Например, кодировка ASCII не может преобразовать в байты кириллицу:

In [32]: hi_unicode = 'привет'

In [33]: hi_unicode.encode('ascii')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-33-ec69c9fd2dae> in <module>()
----> 1 hi_unicode.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

Аналогично, если строка «привет» преобразована в байты, и попробовать
преобразовать ее в строку с помощью ascii, тоже получим ошибку:

In [34]: hi_unicode = 'привет'

In [35]: hi_bytes = hi_unicode.encode('utf-8')

In [36]: hi_bytes.decode('ascii')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-36-aa0ada5e44e9> in <module>()
----> 1 hi_bytes.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Еще один вариант ошибки, когда используются разные кодировки для
преобразований:

In [37]: de_hi_unicode = 'grüezi'

In [38]: utf_16 = de_hi_unicode.encode('utf-16')

In [39]: utf_16.decode('utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-39-4b4c731e69e4> in <module>()
----> 1 utf_16.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Наличие ошибок — это хорошо. Они явно говорят, в чем проблема.
Хуже, когда получается так:

In [40]: hi_unicode = 'привет'

In [41]: hi_bytes = hi_unicode.encode('utf-8')

In [42]: hi_bytes
Out[42]: b'xd0xbfxd1x80xd0xb8xd0xb2xd0xb5xd1x82'

In [43]: hi_bytes.decode('utf-16')
Out[43]: '뿐胑룐닐뗐苑'

Обработка ошибок#

У методов encode и decode есть режимы обработки ошибок, которые
указывают, как реагировать на ошибку преобразования.

Параметр errors в encode#

По умолчанию encode использует режим strict — при возникновении ошибок
кодировки генерируется исключение UnicodeError. Примеры такого поведения
были выше.

Вместо этого режима можно использовать replace, чтобы заменить символ
знаком вопроса:

In [44]: de_hi_unicode = 'grüezi'

In [45]: de_hi_unicode.encode('ascii', 'replace')
Out[45]: b'gr?ezi'

Или namereplace, чтобы заменить символ именем:

In [46]: de_hi_unicode = 'grüezi'

In [47]: de_hi_unicode.encode('ascii', 'namereplace')
Out[47]: b'gr\N{LATIN SMALL LETTER U WITH DIAERESIS}ezi'

Кроме того, можно полностью игнорировать символы, которые нельзя
закодировать:

In [48]: de_hi_unicode = 'grüezi'

In [49]: de_hi_unicode.encode('ascii', 'ignore')
Out[49]: b'grezi'

Параметр errors в decode#

В методе decode по умолчанию тоже используется режим strict и
генерируется исключение UnicodeDecodeError.

Если изменить режим на ignore, как и в encode, символы будут просто
игнорироваться:

In [50]: de_hi_unicode = 'grüezi'

In [51]: de_hi_utf8 = de_hi_unicode.encode('utf-8')

In [52]: de_hi_utf8
Out[52]: b'grxc3xbcezi'

In [53]: de_hi_utf8.decode('ascii', 'ignore')
Out[53]: 'grezi'

Режим replace заменит символы:

In [54]: de_hi_unicode = 'grüezi'

In [55]: de_hi_utf8 = de_hi_unicode.encode('utf-8')

In [56]: de_hi_utf8.decode('ascii', 'replace')
Out[56]: 'gr��ezi'

Источник

I’m reading and parsing an Amazon XML file and while the XML file shows a ‘ , when I try to print it I get the following error:

'ascii' codec can't encode character u'u2019' in position 16: ordinal not in range(128)

From what I’ve read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?

asked Jul 11, 2010 at 19:00

Likely, your problem is that you parsed it okay, and now you’re trying to print the contents of the XML and you can’t because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the ‘ignore’ part will tell it to just skip those characters. From the python docs:

>>> # Python 2: u = unichr(40960) + u'abcd' + unichr(1972)
>>> u = chr(40960) + u'abcd' + chr(1972)
>>> u.encode('utf-8')
'xeax80x80abcdxdexb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character 'ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'ꀀabcd޴'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what’s going on. After the read, you’ll stop feeling like you’re just guessing what commands to use (or at least that happened to me).

Mike T

40.6k18 gold badges150 silver badges199 bronze badges

answered Jul 11, 2010 at 19:10

Scott StaffordScott Stafford

43.3k26 gold badges129 silver badges177 bronze badges

A better solution:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

If you would like to read more about why:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1

twasbrillig

16.7k9 gold badges43 silver badges66 bronze badges

answered Jan 9, 2014 at 20:24

PaxwellPaxwell

73810 silver badges18 bronze badges

Don’t hardcode the character encoding of your environment inside your script; print Unicode text directly instead:

assert isinstance(text, unicode) # or str on Python 3
print(text)

If your output is redirected to a file (or a pipe); you could use PYTHONIOENCODING envvar, to specify the character encoding:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

Otherwise, python your_script.py should work as is — your locale settings are used to encode the text (on POSIX check: LC_ALL, LC_CTYPE, LANG envvars — set LANG to a utf-8 locale if necessary).

To print Unicode on Windows, see this answer that shows how to print Unicode to Windows console, to a file, or using IDLE.

answered Jun 29, 2015 at 7:46

jfsjfs

396k192 gold badges978 silver badges1667 bronze badges

Excellent post : http://www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or 
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

answered Sep 13, 2016 at 18:31

Ranvijay SachanRanvijay Sachan

2,3973 gold badges29 silver badges48 bronze badges

You can use something of the form

s.decode('utf-8')

which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don’t ever access the XML string directly, you might have to use a decoder object from the codecs module.

answered Jul 11, 2010 at 19:04

David ZDavid Z

127k27 gold badges254 silver badges279 bronze badges

I wrote the following to fix the nuisance non-ascii quotes and force conversion to something usable.

unicodeToAsciiMap = {u'u2019':"'", u'u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

answered Sep 10, 2015 at 11:31

Try adding the following line at the top of your python script.

# _*_ coding:utf-8 _*_

answered Jan 20, 2016 at 5:08

abnvanandabnvanand

1931 silver badge6 bronze badges

Python 3.5, 2018

If you don’t know what the encoding but the unicode parser is having issues you can open the file in Notepad++ and in the top bar select Encoding->Convert to ANSI. Then you can write your python like this

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

answered Oct 9, 2018 at 21:56

Atomar94Atomar94

471 silver badge11 bronze badges

Источник

The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail.

Encoding from unicode to str.

>>> u"a".encode("iso-8859-15")
'a'
>>> u"u0411".encode("iso-8859-15")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "encodings/iso8859_15.py", line 12, in encode
UnicodeEncodeError: 'charmap' codec can't encode character u'u0411' in position 0: character maps to <undefined>

>>> u"au0411b".encode("iso-8859-15", "replace")
'a?b'
>>> u"au0411b".encode("iso-8859-15", "backslashreplace")
'a\u0411b'
>>> u"au0411b".encode("iso-8859-15", "xmlcharrefreplace")
'aБb'

Paradoxically, a UnicodeEncodeError may happen when _decoding_. The cause of it seems to be the coding-specific decode() functions that normally expect a parameter of type str. It appears that on seeing a unicode parameter, the decode() functions «down-convert» it into str, then decode the result assuming it to be of their own coding. It also appears that the «down-conversion» is performed using the ASCII encoder. Hence an encoding failure inside a decoder.

The choice of the ASCII encoder for «down-conversion» might be considered wise because it is an intersection of all codings. The subsequent decoding may only accept a coding-specific str.

However, unlike a similar issue with UnicodeDecodeError while encoding, there would be not ambiguity if decode() simply returned the unicode argument unmodified. There seems to be not such a shortcut in decode() functions as of Python2.5.

Alternatively, a TypeError exception could always be thrown on receiving a unicode argument in decode() functions. (This would require stream.read() to produce only str for StreamReader.read(). The latter would only produce unicode).

Decoding from str to unicode.

>>> "a".decode("utf-8")
u'a'
>>> "xd0x91".decode("utf-8")
u'u0411'
>>> u"a".decode("utf-8")      # Unexpected argument type.
u'a'
>>> u"u0411".decode("utf-8") # Unexpected argument type.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "encodings/utf_8.py", line 16, in decode
UnicodeEncodeError: 'ascii' codec can't encode character u'u0411' in position 0: ordinal not in range(128)

Python 3000 will prohibit decoding of Unicode strings, according to PEP 3137: «encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and returns a Unicode string».

CategoryUnicode

Источник

В python есть 2 объекта работающими с текстом: unicode и str, объект unicode хранит символы в формате (кодировке) unicode, объект str является набором байт/символов в которых python хранит остальные кодировки (utf8, cp1251, cp866, koi8-r и др).

Кодировку unicode можно считать рабочей кодировкой питона т.к. она предназначена для её использования в самом скрипте — для разных операций над строками.

Внешняя кодировка (объект str) предназначена для хранения и передачи текстовой информации вне скрипта, например для сохранения в файл или передачи по сети. Поэтому в данной статье я её назвал внешней. Самой используемой кодировкой в мире является utf8 и число приложений переходящих на эту кодировку растет каждый день, таким образом превращаясь в «стандарт».

Эта кодировка хороша тем что для хранения текста она занимает оптимальное кол-во памяти и с помощью её можно закодировать почти все языки мира ( в отличие от cp1251 и подобных однобайтовых кодировок). Поэтому рекомендуется везде использовать utf8, и при написании скриптов.

Использование

Скрипт питона, в самом начале скрипта указываем кодировку файла и сохраняем в ней файл

# coding: utf8

либо

# -*- coding: utf-8 -*-

для того что-бы интерпретатор python понял в какой кодировке файл

Строки в скрипте

Строки в скрипте хранятся байтами, от кавычки до кавычки:

print 'Привет'

= 6 байт при cp1251

= 12 байт при utf8

Если перед строкой добавить символ u, то при запуске скрипта, эта байтовая строка будет декодирована в unicode из кодировки указанной в начале:

# coding:utf8
print u'Привет'

и если кодировка содержимого в файле отличается от указанной, то в строке могут быть «битые символы»

Загрузка и сохранение файла

# coding: utf8
# Загружаем файл с кодировкай utf8
text = open('file.txt','r').read()
# Декодируем из utf8 в unicode - из внешней в рабочую
text = text.decode('utf8')
# Работаем с текстом
text += text
# Кодируем тест из unicode в utf8 - из рабочей во внешнюю
text = text.encode('utf8')
# Сохраняем в файл с кодировкий utf8
open('file.txt','w').write(text)

Текст в скрипте

# coding: utf8
a = 'Текст в utf8'
b = u'Текст в unicode'
# Эквивалентно: b = 'Текст в unicode'.decode('utf8')
# т.к. сам скрипт хранится в utf8
print 'a =',type(a),a
# декодируем из utf-8 в unicode и далее unicode в cp866 (кодировка консоли winXP ru)
print 'a2 =',type(a),a.decode('utf8').encode('cp866')
print 'b =',type(b),b

Процедуре print текст желательно передавать в рабочей кодировке либо кодировать в кодировку ОС.

Результат скрипта при запуске из консоли windows XP:

a = ╨в╨╡╨║╤Б╤В ╨▓ utf8

a2 = Текст в utf8

b = Текст в unicode

В последней строке print преобразовал unicode в cp866 автоматический, см. следующий пункт

Авто-преобразование кодировки

В некоторых случаях для упрощения разработки python делает преобразование кодировки, пример с методом print можно посмотреть в предыдущем пункте.

В примере ниже, python сам переводит utf8 в unicode — приводит к одной кодировке для того что-бы сложить строки.

# coding: utf8
# Устанавливаем стандартную внешнюю кодировку = utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
a = 'Текст в utf8'
b = u'Текст в unicode'
c = a + b
print 'a =',type(a),a
print 'b =',type(b),b
print 'c =',type(c),c

Результат

a = Текст в utf8

b = Текст в unicode

c = Текст в utf8Текст в unicode

Как видим результирующая строка «c» в unicode. Если бы кодировки строк совпадали то авто-перекодирования не произошло бы и результирующая строка содержала кодировку слагаемых строк.

Авто-перекодирование обычно срабатывает когда происходит взаимодействие разных кодировок.

Пример авто-преобразования кодировок в сравнении

# coding: utf8
# Устанавливаем стандартную внешнюю кодировку = utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
print '1. utf8 and unicode', 'true' if u'Слово'.encode('utf8') == u'Слово' else 'false'
print '2. utf8 and cp1251', 'true' if u'Слово'.encode('utf8') == u'Слово'.encode('cp1251') else 'false'
print '3. cp1251 and unicode', 'true' if u'Слово'.encode('cp1251') == u'Слово' else 'false'

Результат

1. utf8 and unicode true

2. utf8 and cp1251 false

script.py:10: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode — interpreting them as being unequal

print ‘3. cp1251 and unicode’, ‘true’ if u’Слово’.encode(‘cp1251′) == u’Слово’ else ‘false’

3. cp1251 and unicode false

В сравнении 1, кодировка utf8 преобразовалась в unicode и сравнение произошло корректно.

В сравнении 2, сравниваются кодировки одного вида — обе внешние, т.к. кодированы они в разных кодировках условие выдало что они не равны.

В сравнении 3, выпало предупреждение из за того что выполняется сравнение кодировок разного вида — рабочая и внешняя, а авто-декодирование не произошло т.к. стандартная внешняя кодировка = utf8, и декодировать строку в кодировке cp1251 методом utf8 питон не смог.

Вывод списков

# coding: utf8
d = ['Тест','списка']
print '1',d
print '2',d.__repr__()
print '3',','.join(d)

Результат:

1 [‘xd0xa2xd0xb5xd1x81xd1x82’, ‘xd1x81xd0xbfxd0xb8xd1x81xd0xbaxd0xb0’]

2 [‘xd0xa2xd0xb5xd1x81xd1x82’, ‘xd1x81xd0xbfxd0xb8xd1x81xd0xbaxd0xb0’]

3 Тест,списка

При выводе списка, происходит вызов [{repr}]() который возвращает внутреннее представление этого спиcка — print 1 и 2 являются аналогичными. Для корректного вывода списка, его нужно преобразовать в строку — print 3.

Установка внешней кодировки при запуске

PYTHONIOENCODING=utf8 python 1.py

В обучении ребенка важно правильное толкование окружающего его мира. Существует масса полезных журналов которые начнут экологическое воспитание дошкольников правильным путем. Развивать интерес к окружающему миру очень трудный но интересный процесс, уделите этому особое внимание.

Источник

Summary: The UnicodeEncodeError generally occurs while encoding a Unicode string into a certain coding. Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode(utf-8) and decode(utf-8) functions accordingly in your code.

You might be using handling an application code that needs to deal with multilingual data or web content that has plenty of emojis and special symbols. In such situations, you will possibly come across numerous problems relating to Unicode data. But python has well-defined options to deal with Unicode characters and we shall be discussing them in this article.

What is Unicode?

Unicode is a standard that facilitates character encoding using variable bit encoding. I am sure, you must have heard of ASCII if you are into the world of computer programming. ASCII represents 128 characters while Unicode defines 2²¹ characters. Thus, Unicode can be regarded as a superset of ASCII. If you are interested in having an in-depth look at Unicode, please follow this link.
Click on Unicode:- U+1F40D to find out what it represents! (Try it!!!?)

What is a UnicodeEncodeError?

The best way to grasp any concept is to visualize it with an example. So let us have a look at an example of the UnicodeEncodeError.

u = 'é'
print("Integer value for é: ", ord(u))
print("Converting the encoded value of é to Integer Equivalent: ", chr(233))
print("UNICODE Representation of é: ", u.encode('utf-8'))
print("ASCII Representation of é: ", u.encode('ascii'))

Output

Integer value for é:  233
Converting the encoded value of é to Integer Equivalent:  é
UNICODE Representation of é:  b'xc3xa9'
Traceback (most recent call last):
  File "main.py", line 5, in <module>
    print("ASCII Representation of é: ",u.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character 'xe9' in position 0: ordinal not in range(128)

In the above code, when we tried to encode the character é to its Unicode value we got an output but while trying to convert it to the ASCII equivalent we encountered an error. The error occurred because ASCII only allows 7-bit encoding and it cannot represent characters outside the range of [0..128].

You now have an essence of what the UnicodeEncodeError looks like. Before discussing how we can avoid such errors, I feel that there is a dire need to discuss the following concepts:

Encoding and Decoding

The process of converting human-readable data into a specified format, for the secured transmission of data is known as encoding. Decoding is the opposite of encoding that is to convert the encoded information to normal text (human-readable form).

In Python,

encode() is an inbuilt method used for encoding. Incase no encoding is specified, UTF-8 is used as default.
decode() is an inbuilt method used for decoding.

Example:

u = 'Πύθωνος'
print("UNICODE Representation of é: ", u.encode('utf-8'))

Output:

UNICODE Representation of é:  b'xcexa0xcfx8dxcexb8xcfx89xcexbdxcexbfxcfx82'

The following diagram should make things a little easier:

Codepoint

Unicode maps the codepoint to their respective characters. So, what do we mean by a codepoint?

Codepoints are numerical values or integers used to represent a character.
The Unicode code point for é is U+00E9 which is integer 233. When you encode a character and print it, you will generally get its hexadecimal representation as an output instead of its binary equivalent (as seen in the examples above).
The byte sequence of a code point is different in different encoding schemes. For eg: the byte sequence for é in UTF-8 is xc3xa9 while in UTF-16 is xffxfexe9x00.

Please have a look at the following program to get a better grip on this concept:

u = 'é'
print("INTEGER value for é: ", ord(u))
print("ENCODED Representation of é in UTF-8: ", u.encode('utf-8'))
print("ENCODED Representation of é in UTF-16: ", u.encode('utf-16'))

Output

INTEGER value for é:  233
ENCODED Representation of é in UTF-8:  b'xc3xa9'
ENCODED Representation of é in UTF-16:  b'xffxfexe9x00'

Now that we have an overview of Unicode and UnicodeEncodeError, let us discuss how we can deal with the error and avoid it in our program.

➥ Problem: Given a string/text to be written in a text File; how to avoid the UnicodeEncodeError and write given text in the text file.

Example:

f = open('demo.txt', 'w')
f.write('να έχεις μια όμορφη μέρα')
f.close()

Output:

Traceback (most recent call last):
  File "uniError.py", line 2, in <module>
    f.write('να έχεις μια όμορφη μέρα')
  File "C:UsersShubham-PCAppDataLocalProgramsPythonPython38-32libencodingscp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: character maps to <undefined>

✨ Solution 1: Encode String Before Writing To File And Decode While Reading

You cannot write Unicode to a file directly. This will raise an UnicodeEncodeError. To avoid this you must encode the Unicode string using the encode() function and then write it to the file as shown in the program below:

text = u'να έχεις μια όμορφη μέρα'
# write in binary mode to avoid TypeError
f = open('demo.txt', 'wb')
f.write(text.encode('utf8'))
f.close()
f = open('demo.txt', 'rb')
print(f.read().decode('utf8'))

Output:

να έχεις μια όμορφη μέρα

✨ Solution 2: Open File In utf-8

If you are using Python 3 or higher, all you need to do is open the file in utf-8, as Unicode string handling is already standardized in Python 3.

text = 'να έχεις μια όμορφη μέρα'
f = open('demo2.txt', 'w', encoding="utf-8")
f.write(text)
f.close()

Output:

✨ Solution 3: Using The Codecs Module

Another approach to deal with the UnicodeEncodeError is using the codecs module.

Let us have a look at the following code to understand how we can use the codecs module:

import codecs

f = codecs.open("demo3.txt", "w", encoding='utf-8')
f.write("να έχεις μια όμορφη μέρα")
f.close()

Output:

✨ Solution 4: Using Python’s unicodecsv Module

If you are dealing with Unicode data and using a csv file for managing your data, then the unicodecsv module can be really helpful. It is an extended version of Python 2’s csv module and helps the user to handle Unicode data without any hassle.

Since the unicodecsv module is not a part of Python’s standard library, you have to install it before using it. Use the following command to install this module:

$ pip install unicodecsv

Let us have a look at the following example to get a better grip on the unicodecsv module:

import unicodecsv as csv

with open('example.csv', 'wb') as f:
    writer = csv.writer(f, encoding='utf-8')
    writer.writerow(('English', 'Japanese'))
    writer.writerow((u'Hello', u'こんにちは'))

Output:

Conclusion

In this article, we discussed some of the important concepts regarding Unicode character and then went on to learn about the UnicodeEncodeError and finally discussed the methods that we can use to avoid it. I hope by the end of this article you can handle Unicode characters in your python code with ease.

Please subscribe and stay tuned for more interesting articles!

Where to Go From Here?

Enough theory. Let’s get some practice!

Coders get paid six figures and more because they can solve problems more effectively using machine intelligence and automation.

To become more successful in coding, solve more real problems for real people. That’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

You build high-value coding skills by working on practical coding projects!

Do you want to stop learning with toy projects and focus on practical code projects that earn you money and solve real problems for people?

🚀 If your answer is YES!, consider becoming a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

If you just want to learn about the freelancing opportunity, feel free to watch my free webinar “How to Build Your High-Income Skill Python” and learn how I grew my coding business online and how you can, too—from the comfort of your own home.

Join the free webinar now!

I am a professional Python Blogger and Content creator. I have published numerous articles and created courses over a period of time. Presently I am working as a full-time freelancer and I have experience in domains like Python, AWS, DevOps, and Networking.

You can contact me @:

UpWork
LinkedIn

Источник

Unknown encoding error might occur when the current encoding doesn’t support the characters in your Python script or when you go for an invalid encoding type. There can be more reasons based on the information attached to the error statement.

But you should not bother about it while having this article opened up on your screen because it will guide you in the best way. After reading this post, you’ll have a variety of solutions to fix the given error.

Contents

Why Is the Error Unknown Encoding Occurring?
- – The Current Console Encoding Doesn’t Support Your Script’s Characters
- – Your Script’s Encoding Type Is Invalid
- – The Black Hat Python Example for Python 2
- – The ML Kit Fails To Scan
How To Fix the Unknown Encoding Error?
- – Choose an Encoding from the Standard List
- – Set PYTHONIOENCODING=UTF-8
- – Leverage the UTF-8 Support Option in Windows 10 and Onwards
- – Let the win-unicode-console Package Serve You
- – Replace a Few Lines in the Black Hat Python 2 Example
- – Try Using the Newest Version of the ML Kit
Conclusion

Why Is the Error Unknown Encoding Occurring?

The above error is occurring because the current console encoding isn’t allowing you to print your Python script’s characters. Plus, an invalid encoding type and running Python 2 example with Python 3 are some possible causes behind the occurrence of this error.

– The Current Console Encoding Doesn’t Support Your Script’s Characters

Do you know about the code page system of Windows? It is used to support multiple characters and languages in the Windows console. The Windows version <= 9 doesn’t have support for UTF-8. Therefore, if you are using Windows 9 or less, the LookupError: Unknown Encoding: UTF-8 will occur while running your PHP script with UTF-8 encoding.

– Your Script’s Encoding Type Is Invalid

If you set your script’s encoding type to a type that doesn’t exist in the list of standard encodings, you’ll have the same error posing a hurdle in your program’s execution. For example, if you set the encoding to something like “utff-8” or “ascii-32,” it’ll surely throw an error.

– The Black Hat Python Example for Python 2

Are you trying to run the Black Hat Python example for Python 2 while using Python 3? You might get the unknown encoding charmap error due to using a different Python version. However, some amendments to the code will do the work. You can read the complete solution below.

– The ML Kit Fails To Scan

The official German medication plan data matrix (BMP) expects the ISO-8859-1 encoded data. Now, if the data contains German umlaut, the ML kit will fail to scan it, throwing an unknown encoding QR code error.

How To Fix the Unknown Encoding Error?

You can fix the given error by confirming the correctness of the encoding type. Other ways that you can try to fix this error include setting the PYTHONIOENCODING to UTF-8, enabling UTF-8 support in Windows version >= 10, using the win-unicode-console package etc.

We have discussed all of the solutions in detail so, stick with us till the end.

– Choose an Encoding from the Standard List

Before you go for more solutions, check if your script’s encoding type is valid and included in the standard list of encodings. Some of the popular encoding types have been added to the following list. Go through it to quickly check the correctness of your script’s encoding type.

646 or us-ascii (English)
IBM037 or IBM039 (English)
437 or IBM437 (English)
utf-8 or cp65001 (All languages)
utf-16 (All languages)
utf-32 (All languages)
utf-7 (All languages)

Most people also ask if they can fix this unknown encoding error online. The answer is a yes, you just have to find a suitable platform and put your encoded text there. You will get a variety of decoded text in return.

– Set PYTHONIOENCODING=UTF-8

It would be best to set the value of PYTHONIOENCODING to UTF-8 without reloading the terminal to get rid of the error. If you are using the command prompt, then execute set PYTHONIOENCODING=UTF-8. However, you should run $env:PYTHONIOENCODING = “UTF-8” while using Power Shell. Once you notice that the error has disappeared, consider going for a permanent solution: adding an environment variable.

So, create a .env file inside your project’s root directory and add PYTHONIOENCODING=UTF-8 to the same. It will ensure that you don’t get the error the next time you run your scripts. The given encoding will override the encodings set for stderr, stdin, and stdout in the script.

If you don’t want to create a .env file, here is how you can add an environment variable in Windows 10:

Right-click on the Windows icon displayed on your taskbar.
Choose System.
Go for the Advanced tab.
Hit Environment Variables.
Press the button that reads new to add “PYTHONIOENCODING” or hit edit to change its value if it already exists.
Set the value of PYTHONIOENCODING to UTF-8.
Press the button labelled Apply.
Hit OK to end the process.

Note that cp65001 is another name for UTF-8 or Unicode used by Windows. Please feel free to apply the above solution to resolve the following errors:

Unknown encoding: cp65001
Unknown encoding UTF 8
LookupError: unknown encoding: cp0

– Leverage the UTF-8 Support Option in Windows 10 and Onwards

The option to enable UTF-8 support in Windows 10 and later versions can help eliminate the error. It will set the locale code page to UTF-8. Hence, once you use it, you won’t fall victim to the encoding issues. Here are the steps that’ll lead you to the stated option and help you leverage it:

Press the Windows and R keys at once.
Open intl.cpl.
Click the tab that says Administrative.
Hit the button that reads Change system locale.
Look for the checkbox labelled Beta: Use Unicode UTF-8 for worldwide language support and put a check on it.
Press OK to save the settings.
Restart your system.

– Let the win-unicode-console Package Serve You

The win-unicode-console package is designed to solve your encoding issues while you run Python from the Windows console. Here is the command to install the said package in your system:

pip install win_unicode_console

Now, you can call the win_unicode_console.enable() to fix the encoding issues on your side and kick away the error. Later, if you want to revert the changes made by the same package, you can call this: win_unicode_console.disable().

Note that you must install the package yourself instead of asking a third-party developer to do it for you. It is because, in the latter case, a dependency will be added.

– Replace a Few Lines in the Black Hat Python 2 Example

To run the given example in Python 3 without facing the encoding problem, you’ll have to make a few changes to the code. Please go through the below instructions to modify the code and make it work for you:

Remove charmap from the /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/OpenSSL/_util.py file.
In the get_file_contents() function, replace tree = branch.commit.commit.tree.recurse() with tree = branch.commit.commit.tree.to_tree().recurse().
In the store_module_result() function, replace repo.create_file(remote_path, “Commit message”, base64.b64encode(data)) with repo.create_file(remote_path, “Commit message”, base64.b64encode(data.encode())).
In the load_module() method, replace exec(self.current_module_code in module.__dict__) with exec(self.current_module_code, module.dict).
Replace import Queue with import queue.

– Try Using the Newest Version of the ML Kit

The failed QR code scanning issue in the ML kit can only be solved if the library accepts an encoding type except for ASCII and UTF-8. If it can’t accept any other encoding, then it should at least provide access to the scanned data as a byte array.

As the new versions of libraries come with new features, the latest version of the ML kit might have one or both of the stated features. If it has, it will help you scan the QR codes successfully without throwing any errors.

Conclusion

After reading this post, you can say that the unknown encoding Python error is not as complicated as it seems. You can have a look at the short listicle below to see the solutions from a broader side:

Ensure that the encoding type is valid before running the command.
Run the command: set PYTHONIOENCODING=UTF-8 in the command prompt without reloading it.
Install the win-unicode-console package to get rid of the encoding problems.
Use the updated version of the ML kit to scan the barcodes or QR codes without facing any errors.
Replace a few lines of code in the Black Hat Python example to make it work with Python 3.

The variety in the solutions is based on different situations in which the encoding issue occurs. If you ask about the most popular working solution, the one setting the environment variable PYTHONIOENCODING wins over others.

Author
Recent Posts

Your Go-To Resource for Learn & Build: CSS,JavaScript,HTML,PHP,C++ and MYSQL. Meet The Team

Источник

Introduction to Python Unicode Error

Python defines Unicode as a string type that facilitates the representation of characters, enabling Python programs to handle a vast array of different characters. For example, any directory path or link address as a string. When we use such a string as a parameter to any function, there is a possibility of the occurrence of an error. Such an error is known as a Unicode error in Python. We get such an error because any character after the Unicode escape sequence (“ u ”) produces an error which is a typical error on Windows.

Working of Unicode Error in Python with Examples

Unicode standard in Python is the representation of characters in code point format. These standards are made to avoid ambiguity between the characters specified, which may occur Unicode errors. For example, let us consider “ I ” as roman number one. It can even be considered the capital alphabet “ i ”; they look the same but are two different characters with different meanings. To avoid such ambiguity, we use Unicode standards.

In Python, Unicode standards have two error types: Unicode encodes error and Unicode decode error. In Python, it includes the concept of Unicode error handlers. Whenever an error or problem occurs during the encoding or decoding process of a string or given text, these handlers are invoked. To include Unicode characters in the Python program, we first use the Unicode escape symbol you before any string, which can be considered a Unicode-type variable.

Syntax:

Unicode characters in Python programs can be written as follows:

"u dfskgfkdsg"

"sakjhdxhj"

"u1232hgdsa"

In the above syntax, we can see 3 different ways of declaring Unicode characters. In the Python program, we can write Unicode literals with prefixes either “u” or “U” followed by a string containing alphabets and numerals, where we can see the above two syntax examples. At the end last syntax sample, we can also use the “u” Unicode escape sequence to declare Unicode characters in the program. In this, we have to note that using “u,” we can write a string containing any alphabet or numerical, but when we want to declare any hex value, then we have to “x” escape sequence, which takes two hex digits and for octal, it will take digit 777.

Example #1

Now let us see an example below for declaring Unicode characters in the program.

Code:

#!/usr/bin/env python
# -*- coding: latin-1 -*-
a= u'dfsfxacu1234'
print("The value of the above unicode literal is as follows:")
print(ord(a[-1]))

Output:

In the above program, we can see the sample of Unicode literals in the Python program. Still, before that, we need to declare encoding, which is different in different versions of Python, and in this program, we can see it in the first two lines of the program.

Now we’ll see that Unicode faults, such as Unicode encoding and decoding failures, are promptly invoked if the problems arise.. There are 3 typical errors in Python Unicode error handlers.

In Python, strict error raises UnicodeEncodeError and UnicodeDecodeError for encoding and decoding failures, respectively.

Example #2

UnicodeEncodeError demonstration and its example.

In Python, it cannot detect Unicode characters, and therefore it throws an encoding error as it cannot encode the given Unicode string.

Code:

str(u'éducba')

Output:

In the above program, we can see we have passed the argument to the str() function, which is a Unicode string. But this function will use the default encoding process ASCII. The program mentioned above throws an error due to the lack of encoding specification at the start. The default encoding used is a 7-bit encoding, which cannot recognize characters beyond the range of 0 to 128. Therefore, we can see the error that is displayed in the above screenshot.

The above program can be fixed by manually encoding the Unicode string, such as.encode(‘utf8’), before providing it to the str() function.

Example #3

In this program, we have called the str() function explicitly, which may again throw an UnicodeEncodeError.

Code:

a = u'café'
b = a.encode('utf8')
r = str(b)
print("The unicode string after fixing the UnicodeEncodeError is as follows:")
print(r)

Output:

In the above, we can show how to avoid UnicodeEncodeError manually by using .encode(‘utf8’) to the Unicode string.

Example #4

Now we will see the UnicodeDecodeError demonstration and its example and how to avoid it.

Code:

a = u'éducba'
b = a.encode('utf8')
unicode(b)

Output:

In the above program, we can see we are trying to print the Unicode characters by encoding first; then, we are trying to convert the encoded string into Unicode characters, which means decoding back to Unicode characters as given at the start. In the above program, we get an error as UnicodeDecodeError when we run. So to avoid this error, we have to decode the Unicode character “b manually.”

Decode

So we can fix it by using the below statement, and we can see it in the above screenshot.

b.decode(‘utf8’)

Conclusion

In this article, we conclude that in Python, Unicode literals are other types of string for representing different types of string. In this article, we also saw how to fix these errors manually by passing the string to the function.

Зачем эта статья?

Откуда взялась статья?

Проблема чтения файлов

Кодировка

Ошибки, связанные с кодировками

Cворачивание регистра

Нормализация

Дополнительные материалы: что использовалось в статье и что почитать по теме

Ошибки при конвертации#

Обработка ошибок#

Параметр errors в encode#

Параметр errors в decode#

Использование

Строки в скрипте

Загрузка и сохранение файла

Текст в скрипте

Авто-преобразование кодировки

Результат

Пример авто-преобразования кодировок в сравнении

Результат

Вывод списков

Результат:

Установка внешней кодировки при запуске

What is Unicode?

What is a UnicodeEncodeError?

Encoding and Decoding

Codepoint

✨ Solution 1: Encode String Before Writing To File And Decode While Reading

✨ Solution 2: Open File In utf-8

✨ Solution 3: Using The Codecs Module

✨ Solution 4: Using Python’s unicodecsv Module

Conclusion

Where to Go From Here?

Why Is the Error Unknown Encoding Occurring?

– The Current Console Encoding Doesn’t Support Your Script’s Characters

– Your Script’s Encoding Type Is Invalid

– The Black Hat Python Example for Python 2

– The ML Kit Fails To Scan

How To Fix the Unknown Encoding Error?

– Choose an Encoding from the Standard List

– Set PYTHONIOENCODING=UTF-8

– Leverage the UTF-8 Support Option in Windows 10 and Onwards

– Let the win-unicode-console Package Serve You

– Replace a Few Lines in the Black Hat Python 2 Example

– Try Using the Newest Version of the ML Kit

Conclusion

Introduction to Python Unicode Error

Working of Unicode Error in Python with Examples

Example #1

Example #2

Example #3

Example #4

Conclusion

Recommended Articles

Не пропустите эти материалы по теме: