I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)
I open the CSV using:
15 ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='t', quotechar='"')
Then, I attempt to encode it with:
name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
I’m encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.
Traceback (most recent call last):
File "push_into_db.py", line 80, in <module>
main()
File "push_into_db.py", line 74, in main
district_map = buildDistrictSchoolMap()
File "push_into_db.py", line 32, in buildDistrictSchoolMap
county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
I think I should tell you that I’m using python 2.7.2, and this is part of an app build on django 1.4. I’ve read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.
You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.
Back to top
Toggle table of contents sidebar
Ошибки при конвертации#
При конвертации между строками и байтами очень важно точно знать, какая
кодировка используется, а также знать о возможностях разных кодировок.
Например, кодировка ASCII не может преобразовать в байты кириллицу:
In [32]: hi_unicode = 'привет' In [33]: hi_unicode.encode('ascii') --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-33-ec69c9fd2dae> in <module>() ----> 1 hi_unicode.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
Аналогично, если строка «привет» преобразована в байты, и попробовать
преобразовать ее в строку с помощью ascii, тоже получим ошибку:
In [34]: hi_unicode = 'привет' In [35]: hi_bytes = hi_unicode.encode('utf-8') In [36]: hi_bytes.decode('ascii') --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-36-aa0ada5e44e9> in <module>() ----> 1 hi_bytes.decode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
Еще один вариант ошибки, когда используются разные кодировки для
преобразований:
In [37]: de_hi_unicode = 'grüezi' In [38]: utf_16 = de_hi_unicode.encode('utf-16') In [39]: utf_16.decode('utf-8') --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-39-4b4c731e69e4> in <module>() ----> 1 utf_16.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Наличие ошибок — это хорошо. Они явно говорят, в чем проблема.
Хуже, когда получается так:
In [40]: hi_unicode = 'привет' In [41]: hi_bytes = hi_unicode.encode('utf-8') In [42]: hi_bytes Out[42]: b'xd0xbfxd1x80xd0xb8xd0xb2xd0xb5xd1x82' In [43]: hi_bytes.decode('utf-16') Out[43]: '뿐胑룐닐뗐苑'
Обработка ошибок#
У методов encode и decode есть режимы обработки ошибок, которые
указывают, как реагировать на ошибку преобразования.
Параметр errors в encode#
По умолчанию encode использует режим strict
— при возникновении ошибок
кодировки генерируется исключение UnicodeError. Примеры такого поведения
были выше.
Вместо этого режима можно использовать replace, чтобы заменить символ
знаком вопроса:
In [44]: de_hi_unicode = 'grüezi' In [45]: de_hi_unicode.encode('ascii', 'replace') Out[45]: b'gr?ezi'
Или namereplace, чтобы заменить символ именем:
In [46]: de_hi_unicode = 'grüezi' In [47]: de_hi_unicode.encode('ascii', 'namereplace') Out[47]: b'gr\N{LATIN SMALL LETTER U WITH DIAERESIS}ezi'
Кроме того, можно полностью игнорировать символы, которые нельзя
закодировать:
In [48]: de_hi_unicode = 'grüezi' In [49]: de_hi_unicode.encode('ascii', 'ignore') Out[49]: b'grezi'
Параметр errors в decode#
В методе decode по умолчанию тоже используется режим strict и
генерируется исключение UnicodeDecodeError.
Если изменить режим на ignore, как и в encode, символы будут просто
игнорироваться:
In [50]: de_hi_unicode = 'grüezi' In [51]: de_hi_utf8 = de_hi_unicode.encode('utf-8') In [52]: de_hi_utf8 Out[52]: b'grxc3xbcezi' In [53]: de_hi_utf8.decode('ascii', 'ignore') Out[53]: 'grezi'
Режим replace заменит символы:
In [54]: de_hi_unicode = 'grüezi' In [55]: de_hi_utf8 = de_hi_unicode.encode('utf-8') In [56]: de_hi_utf8.decode('ascii', 'replace') Out[56]: 'gr��ezi'
Иногда на нашем сервере выскакивает следующая ошибка:
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’u200e’ in position 13: ordinal not in range(128)
Ошибка: порядковый номер вне диапазона (128)
Причина: это ошибка, вызванная проблемой с кодировкой китайских символов в Python, в основном вызванной символом u200e
естьУправляющие символы обозначают надписи слева направо, Это не пробел, полностью невидимый, символ без ширины, мы обычно не видим его на веб-страницах.
аналогичен управляющим символам формата Unicode, таким как «писать метку справа налево» ( u200F) и «писать метку слева направо» ( u200E), нулевая ширина Соединитель ( u200D) и не-коннектор нулевой ширины ( uFEFF) управляют визуальным отображением текста, что важно для правильного отображения некоторых неанглийских текстов.
Решение: добавьте следующий блок операторов в заголовок файла, в котором расположен код Python.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Если вы добавите приведенный выше блок кода, чтобы представить проблему сбоя функции печати в python,Затем замените приведенный выше блок кода следующим блоком кода
import sys # здесь просто ссылка на sys, перезагружается только перезагрузка
stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr
reload(sys) # При ссылке при импорте,Функция setdefaultencoding удаляется после вызова системой, поэтому ее необходимо перезагрузить один раз.
sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde
Overview
Example errors:
Traceback (most recent call last):
File "unicode_ex.py", line 3, in
print str(a) # this throws an exception
UnicodeEncodeError: 'ascii' codec can't encode character u'xa1' in position 0: ordinal not in range(128)
This issue happens when Python can’t correctly work with a string variable.
Strings can contain any sequence of bytes, but when Python is asked to work with the string, it may decide that the string contains invalid bytes.
In these situations, an error is often thrown that mentions ordinal not in range
, or codec can't encode character
, or codec can't decode character
.
Here’s a bit of code that may reproduce the error in Python 2:
a='xa1'
print(a + ' <= problem')
unicode(a)
Initial Steps Overview
-
Check Python version
-
Determine codec and character
Detailed Steps
1) Check Python version
The Python version you are using is significant.
You can determine the Python version by running:
python --version
or, if you have access to the running code, by logging it:
print(sys.version)
The major number (2 or 3) is the number you are interested in.
It is expected that you are using Python2.
2) Determine interpreting codec and character
Get this from the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'xa1' in position 0: ordinal not in range(128)
In this case, the code is ascii
and the character is the hex character A1
.
What is happening here is that Python is trying to interpret a string, and expects that the bytes in that string are legal for the format it’s expecting. In this case, it’s expecting a string composed of ASCII bytes. These bytes are in the range 0-127 (ie 8 bytes). The hex byte A1
is 161 in decimal, and is therefore out of range.
When Python comes to interpret this string in a context that requires a codec (for example, when calling the unicode
function), it tries to ‘encode’ it with the codec, and can hit this problem.
3) Determine desired codec
You need to figure out how the bytes should be interpreted.
Most often in everyday use (eg web scraping or document ingestion), this is utf-8
.
Once you have determined the desired codec, solution A may help you.
Solutions List
A) Decode the string
Solutions Detail
A) Decode the string
If you have a string s
that you want to interpret as utf-8 data, you can try:
s = s.decode('utf-8')
to re-encode the string with the appropriate codec.
Further Information
Owner
Ian Miell
- Unicode Decode Error in Python
- How to Solve the Unicode Decode Error in Python
In this article, we will learn how to resolve the UnicodeDecodeError
that occurs during the execution of the code. We will look at the different reasons that cause this error.
We will also find ways to resolve this error in Python. Let’s begin with what the UnicodeDecodeError
is in Python.
Unicode Decode Error in Python
If you are facing a recurring UnicodeDecodeError
and are unsure of why it is happening or how to resolve it, this is the article for you.
In this article, we go in-depth about why this error comes up and a simple approach to resolving it.
Causes of Unicode Decode Error in Python
In Python, the UnicodeDecodeError
comes up when we use one kind of codec to try and decode bytes that weren’t even encoded using this codec. To be more specific, let’s understand this problem with the help of a lock and key analogy.
Suppose we created a lock that can only be opened using a unique key made specifically for that lock.
What happens when you would try and open this lock with a key that wasn’t made for this lock? It wouldn’t fit.
Let’s create the file example.txt
with the following contents.
Let’s attempt to decode this file using the ascii
codec using the following code.
Example 1:
with open('example.txt', 'r', encoding='ascii') as f:
lines = f.readlines()
print(lines)
The output of the code:
Traceback (most recent call last):
File "/home/fatina/PycharmProjects/examples/main.py", line 2, in <module>
lines = f.readlines()
File "/usr/lib/python3.10/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)
Let’s look at another more straightforward example of what happens when you encode a string using one codec and decode using a different one.
Example 2:
string = '𝘈Ḇ𝖢𝕯٤ḞԍНǏ'
encoded_string = string.encode('utf-8')
decoded_string = encoded_string.decode('ascii')
print(decoded_string)
In this example, we have a string encoded using the utf-8
codec, and in the following line, we try to decode this string using the ascii
codec.
The output of the code:
Traceback (most recent call last):
File "/home/fatina/PycharmProjects/examples/main.py", line 4, in <module>
decoded_string = encoded_string.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)
This happens because the contents of the file in example 1 and the string in example 2 were not encoded using the ascii
codec, but we tried decoding these scripts using it. This results in the UnicodeDecodeError
.
How to Solve the Unicode Decode Error in Python
Resolving this issue is rather straightforward. If we explore Python’s documentation, we will see several standard codecs available to help you decode bytes.
So if we were to replace ascii
with the utf-8
codec in the example codes above, it would have successfully decoded the bytes in example.txt
.
Example code:
with open('example.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
print(lines)
The output of the code:
['𝘈Ḇ𝖢𝕯٤ḞԍНǏn', 'hello world']
As for the second example, you need only to do the same thing.
Example code:
string = '𝘈Ḇ𝖢𝕯٤ḞԍНǏ'
encoded_string = string.encode('utf-8')
decoded_string = encoded_string.decode('utf-8')
print(decoded_string)
The output of the code:
It is important to mention that sometimes a string may not be completely decoded using one codec.
So if the need arrives, you can develop your program to ignore any characters that it cannot decode by simply adding the ignore
argument like this:
with open('example.txt', 'r', encoding='utf-8', errors='ignore') as f:
lines = f.readlines()
print(lines)
While this will skip any errors the compiler encounters while decoding some characters, it is important to mention that this can result in data loss.
We hope you find this article helpful in understanding how to resolve the UnicodeDecodeError
in Python.