Why is the below item failing? Why does it succeed with «latin-1» codec?
o = "a test of xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")
Which results in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:Python27libencodingsutf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte
asked Apr 5, 2011 at 13:23
I had the same error when I tried to open a CSV file by pandas.read_csv
method.
The solution was change the encoding to latin-1
:
pd.read_csv('ml-100k/u.item', sep='|', names=m_cols , encoding='latin-1')
Vishal Singh
5,9392 gold badges17 silver badges33 bronze badges
answered Jul 18, 2015 at 15:33
Mazen AlyMazen Aly
5,5051 gold badge14 silver badges12 bronze badges
2
In binary, 0xE9 looks like 1110 1001
. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx
. So, for example:
>>> b'xe9x80x80'.decode('utf-8')
u'u9000'
But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:
>>> u'xe9'.encode('utf-8')
b'xc3xa9'
>>> u'xe9'.encode('latin-1')
b'xe9'
(Note, I’m using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)
answered Apr 5, 2011 at 13:29
Josh LeeJosh Lee
170k38 gold badges268 silver badges274 bronze badges
2
It is invalid UTF-8. That character is the e-acute character in ISO-Latin1, which is why it succeeds with that codeset.
If you don’t know the codeset you’re receiving strings in, you’re in a bit of trouble. It would be best if a single codeset (hopefully UTF-8) would be chosen for your protocol/application and then you’d just reject ones that didn’t decode.
If you can’t do that, you’ll need heuristics.
answered Apr 5, 2011 at 13:26
1
Because UTF-8 is multibyte and there is no char corresponding to your combination of xe9
plus following space.
Why should it succeed in both utf-8 and latin-1?
Here how the same sentence should be in utf-8:
>>> o.decode('latin-1').encode("utf-8")
'a test of xc3xa9 char'
answered Apr 5, 2011 at 13:28
neurinoneurino
11.3k2 gold badges40 silver badges62 bronze badges
1
If this error arises when manipulating a file that was just opened, check to see if you opened it in 'rb'
mode
answered Jul 4, 2018 at 23:09
2
Use this, If it shows the error of UTF-8
pd.read_csv('File_name.csv',encoding='latin-1')
answered Apr 14, 2020 at 7:21
utf-8 code error usually comes when the range of numeric values exceeding 0 to 127.
the reason to raise this exception is:
1)If the code point is < 128, each byte is the same as the value of the code point.
2)If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
In order to to overcome this we have a set of encodings, the most widely used is «Latin-1, also known as ISO-8859-1»
So ISO-8859-1 Unicode points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1
when this exception occurs when you are trying to load a data set ,try using this format
df=pd.read_csv("top50.csv",encoding='ISO-8859-1')
Add encoding technique at the end of the syntax which then accepts to load the data set.
HK boy
1,39811 gold badges17 silver badges25 bronze badges
answered Jan 18, 2020 at 14:37
suryasurya
1811 silver badge3 bronze badges
1
Well this type of error comes when u are taking input a particular file or data in pandas such as :-
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv)
Then the error is displaying like this :-
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf4 in position 1: invalid continuation byte
So to avoid this type of error can be removed by adding an argument
data=pd.read_csv('/kaggle/input/fertilizers-by-product-fao/FertilizersProduct.csv', encoding='ISO-8859-1')
pppery
3,70221 gold badges31 silver badges45 bronze badges
answered Jun 26, 2020 at 17:59
2
This happened to me also, while i was reading text containing Hebrew from a .txt
file.
I clicked: file -> save as
and I saved this file as a UTF-8
encoding
answered Feb 21, 2019 at 7:53
TLDR: I would recommend investigating the source of the problem in depth before switching encoders to silence the error.
I got this error as I was processing a large number of zip files with additional zip files in them.
My workflow was the following:
- Read zip
- Read child zip
- Read text from child zip
At some point I was hitting the encoding error above. Upon closer inspection, it turned out that some child zips erroneously contained further zips. Reading these zips as text lead to some funky character representation that I could silence with encoding="latin-1"
, but which in turn caused issues further down the line. Since I was working with international data it was not completely foolish to assume it was an encoding problem (I had problems with 0xc2: Â
), but in the end it was not the actual issue.
answered Apr 17, 2022 at 10:32
In this case, I tried to execute a .py which active a path/file.sql.
My solution was to modify the codification of the file.sql to «UTF-8 without BOM» and it works!
You can do it with Notepad++.
i will leave a part of my code.
con = psycopg2.connect(host = sys.argv[1],
port = sys.argv[2],dbname = sys.argv[3],user = sys.argv[4], password = sys.argv[5])
cursor = con.cursor()
sqlfile = open(path, 'r')
Zrufy
4039 silver badges21 bronze badges
answered Jun 19, 2019 at 21:26
I encountered this problem, and it turned out that I had saved my CSV directly from a google sheets file. In other words, I was in a google sheet file. I chose, save a copy, and then when my browser downloaded it, I chose Open. Then, I DIRECTLY saved the CSV. This was the wrong move.
What fixed it for me was first saving the sheet as an .xlsx
file on my local computer, and from there exporting single sheet as .csv
. Then the error went away for pd.read_csv('myfile.csv')
answered Sep 26, 2022 at 19:21
Nesha25Nesha25
3623 silver badges11 bronze badges
The solution was change to «UTF-8 sin BOM»
answered Jun 2, 2021 at 21:06
One error that you might encounter when working with Python is:
UnicodeDecodeError: invalid continuation byte
This error occurs when you try to decode a bytes object with an encoding that doesn’t support that character.
This tutorial shows an example that causes this error and how to fix it.
How to reproduce this error
Suppose you have a bytes object in your Python code as follows:
Next, you want to decode the bytes character using the utf-8
encoding like this:
str_obj = bytes_obj.decode('utf-8')
Output:
Traceback (most recent call last):
File "main.py", line 3, in <module>
str_obj = bytes_obj.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1
in position 0: invalid continuation byte
You get an error because the character xe1
in the bytes object is the á
character encoded using latin-1
encoding.
How to fix this error
To resolve this error, you need to change the encoding used in the decode()
method to latin-1
as follows:
bytes_obj = b"xe1 b c"
str_obj = bytes_obj.decode('latin-1')
print(str_obj) # á b c
Note that this time the decode()
method runs without any error.
You can also get this error when running other methods such as pandas read_csv()
method.
You need to specify the encoding used by the method as follows:
pd.read_csv('example.csv', encoding='latin-1')
The same also works when you use the open()
function to work with files:
csv_file = open('example.csv', encoding='latin-1')
# or:
with open('example.csv', encoding='latin-1') as file:
If you only want to read the files without modifying the content, you can use the open()
function in rb
read binary mode.
Here’s an example when you parse an HTML file using Beautiful Soup:
soup = BeautifulSoup(open('index.html', 'rb'), 'html.parser')
print(soup.get_text())
When you decode the bytes object, you need to use the encoding that supports the object.
If you don’t want to encode the object when opening a file, you need to specify the open mode as rb
or wb
to read and write in binary mode.
I hope this tutorial helps. See you in other tutorials! 👍
The «UnicodeDecodeError: invalid continuation byte» error in Python is usually raised when a string of text being processed is not properly encoded as Unicode. This error can occur when reading data from a file or from a database, or when processing data from an external source. To resolve this error, it’s important to understand how the data is being encoded and to make sure that it’s properly decoded before being processed in Python.
Method 1: Use the correct encoding
When you encounter the UnicodeDecodeError
with the message «invalid continuation byte», it means that Python is trying to decode a byte sequence that is not valid for the specified encoding. This error can be fixed by using the correct encoding.
Here are the steps to fix this error using the correct encoding:
Step 1: Determine the Encoding
The first step is to determine the encoding of the byte sequence. You can use the chardet
library to automatically detect the encoding:
import chardet
with open('file.txt', 'rb') as f:
data = f.read()
encoding = chardet.detect(data)['encoding']
Step 2: Decode the Byte Sequence
Once you have determined the encoding, you can decode the byte sequence using the correct encoding:
with open('file.txt', 'r', encoding=encoding) as f:
data = f.read()
Step 3: Handle Errors
If the byte sequence contains invalid characters that cannot be decoded using the specified encoding, you can handle the errors using the errors
parameter:
with open('file.txt', 'r', encoding=encoding, errors='replace') as f:
data = f.read()
The errors
parameter can take the following values:
'strict'
: raise aUnicodeDecodeError
if the byte sequence contains invalid characters'ignore'
: ignore the invalid characters and continue decoding'replace'
: replace the invalid characters with the Unicode replacement character U+FFFD
Step 4: Encode the Unicode String
If you need to encode the Unicode string back to bytes, you can use the encode()
method:
data = 'Hello, world!'
encoded_data = data.encode(encoding)
Here, encoding
is the encoding used to decode the byte sequence.
That’s it! By following these steps, you should be able to fix the UnicodeDecodeError
with the message «invalid continuation byte» in Python by using the correct encoding.
Method 2: Check the data for invalid characters
If you are working with text data in Python, you may encounter the UnicodeDecodeError: invalid continuation byte
error. This error occurs when you try to decode a string that contains invalid characters or bytes. In this tutorial, we will show you how to fix this error by checking the data for invalid characters.
Step 1: Read the File in Binary Mode
The first step is to read the file in binary mode using the rb
mode instead of the r
mode. This will ensure that the file is read as bytes and not as text.
with open('file.txt', 'rb') as file:
data = file.read()
Step 2: Decode the Data
The next step is to decode the data using the appropriate encoding. In this example, we will use the utf-8
encoding.
try:
text = data.decode('utf-8')
except UnicodeDecodeError:
pass
Step 3: Check for Invalid Characters
Now that we have decoded the data, we can check for invalid characters using the isprintable()
method. This method returns True
if all the characters in the string are printable, otherwise it returns False
.
invalid_chars = []
for char in text:
if not char.isprintable():
invalid_chars.append(char)
Step 4: Replace Invalid Characters
Finally, we can replace the invalid characters with a valid character using the replace()
method.
for char in invalid_chars:
text = text.replace(char, '')
Full Example
Here is the full example:
with open('file.txt', 'rb') as file:
data = file.read()
try:
text = data.decode('utf-8')
except UnicodeDecodeError:
pass
invalid_chars = []
for char in text:
if not char.isprintable():
invalid_chars.append(char)
for char in invalid_chars:
text = text.replace(char, '')
This code will read the file in binary mode, decode the data using the utf-8
encoding, check for invalid characters, and replace them with a valid character. This should fix the UnicodeDecodeError: invalid continuation byte
error.
Method 3: Use a try-except block to handle the error
To fix the UnicodeDecodeError: 'utf-8' codec can't decode byte...
error in Python, you can use a try-except block to handle the error. Here’s an example code snippet:
try:
with open('file.txt', 'r', encoding='utf-8') as f:
text = f.read()
except UnicodeDecodeError:
with open('file.txt', 'r', encoding='ISO-8859-1') as f:
text = f.read()
In this code, we try to open the file with UTF-8 encoding. If there’s a UnicodeDecodeError
, we catch it with the except
block and try to open the file again with ISO-8859-1 encoding.
You can also wrap the file reading code in a function to make it more reusable:
def read_file(filename):
try:
with open(filename, 'r', encoding='utf-8') as f:
text = f.read()
except UnicodeDecodeError:
with open(filename, 'r', encoding='ISO-8859-1') as f:
text = f.read()
return text
This function takes a filename as an argument and returns the file’s contents. If there’s a UnicodeDecodeError
, it tries to open the file again with ISO-8859-1 encoding.
In summary, using a try-except block to handle the UnicodeDecodeError
in Python involves trying to open the file with UTF-8 encoding, catching the error if it occurs, and trying to open the file again with another encoding (such as ISO-8859-1). This approach allows you to handle the error gracefully and continue with your program’s execution.
Method 4: Force decode using the «ignore» option
To fix the UnicodeDecodeError
with the invalid continuation byte
error in Python, you can force decode the string using the «ignore» option. Here’s how you can do it in Python:
with open('filename.txt', 'rb') as f:
data = f.read()
try:
decoded_data = data.decode('utf-8', 'ignore')
except UnicodeDecodeError as e:
print(f"Error: {e}")
with open('new_filename.txt', 'w') as f:
f.write(decoded_data)
In this example, we first read the file in binary mode using rb
. This is necessary because the file contains invalid bytes that can’t be decoded directly. Then, we use the decode()
method to decode the data using the «ignore» option. This option tells Python to ignore any invalid bytes and continue decoding the rest of the string. If there are still invalid bytes left after decoding, they will be replaced with the «replacement character» (U+FFFD). Finally, we write the decoded data to a new file in text mode using w
.
Note that this method may result in some data loss, as any invalid bytes will be ignored or replaced with the «replacement character». If you want to preserve all the data in the file, you may need to use a different method, such as manually fixing the invalid bytes or using a different encoding.
Usually, there should be no problem working with Latin characters. Except when interacting with special characters, we can see the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte”.
Why does the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte” appear? And how to solve it?
Encode and decode 2 different character sets
The error appears when we encode with one character set and try to use a different character set when we want to decode an object. See the example for a better understanding.
encoding = 'LearnShäreIT'.encode('latin-1') decoding = encoding.decode('utf-8') print(decoding) # UnicodeDecodeError
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 7
To solve this error, you must use the character set that was previously used for encoding when you decode the string you want, like the code sample below.
encoding = 'LearnShäreIT'.encode('utf-8') # Using the same character set decoding = encoding.decode('utf-8') print(decoding)
Output:
LearnShäreIT
The charset is inconsistent when saving files and reading files
When we create and save a CSV file, we choose the UTF-16 BE charset, as shown below.
But when reading the file with pandas.read_csv(), we use the default character set of read_csv()
which is utf-8. See the code below for a better understanding.
import pandas as pd # Using encoding = 'utf-8' but charset of data.csv = 'utf-16' data = pd.read_csv('data.csv') print(data)
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0
We have to set the encoding='utf-16'
for consistency between encoding and decoding. Like this:
import pandas as pd # Using encoding='utf-16' data = pd.read_csv('data.csv', encoding='utf-16') print(data)
Output:
Name Website
0 LearnShareIT learnshareit.com
1 Facebook facebook.com
2 Google google.com
3 Udemy udemy.com
Using detect()
function in the chardet package
You can use chardet to detect the character encoding of a file. This library is handy when working with a large pile of text. But it can also be used when working with downloaded data you don’t know its charset.
Syntax:
chardet.detect(data)
Parameter:
- data: data in the file you want to detect charset.
The detect() function detects what charset a non-Unicode string is using. It returns a dictionary containing the automatically detected charset and confidence level.
Before using the detect()
function, we need to install the chardet with the following command line:
pip install chardet
Then we will import the chardet at the top of the python file. Next, we pass the data into the detect()
function to detect its charset. After getting the charset, pass it to the read_csv()
. Like this:
import chardet import pandas as pd # Detect character encoding of data.csv enc = chardet.detect(open('data.csv', 'rb').read()) print(enc['encoding']) # UTF-16 # Use pandas to read data.csv data = pd.read_csv('data.csv', encoding=enc['encoding']) print(data)
Output:
UTF-16
Name Website
0 LearnShareIT learnshareit.com
1 Facebook facebook.com
2 Google google.com
3 Udemy udemy.com
Change character encoding manually
This way is very simple. Just open the file you need to read with notepad++. On the menu bar, select Encoding -> Convert to UTF-8. Like this:
Code:
import pandas as pd # Using pandas to read data.csv with charset = UTF-8 data = pd.read_csv('data.csv') print(data)
Output:
Name Website
0 LearnShareIT learnshareit.com
1 Facebook facebook.com
2 Google google.com
3 Udemy udemy.com
Summary
Basically, the error “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte” comes from the inconsistency between the encoding and decoding processes. As long as you make sure to use a character set for encoding and decoding (such as UTF-8), you won’t get this error again.
Have a lucky day!
Maybe you are interested:
- “unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte”
- UnicodeDecodeError: ‘charmap’ codec can’t decode byte
- UnicodeDecodeError: ‘ascii’ codec can’t decode byte
Hi, I’m Cora Lopez. I have a passion for teaching programming languages such as Python, Java, Php, Javascript … I’m creating the free python course online. I hope this helps you in your learning journey.
Name of the university: HCMUE
Major: IT
Programming Languages: HTML/CSS/Javascript, PHP/sql/laravel, Python, Java
Sometimes, we want to fix UnicodeDecodeError, invalid continuation byte with Python.
In this article, we’ll look at how to fix UnicodeDecodeError, invalid continuation byte with Python.
How to fix UnicodeDecodeError, invalid continuation byte with Python?
To fix UnicodeDecodeError, invalid continuation byte with Python, we call decode
to decode the byte string with the right encoding.
For instance, we write
s = b'xe9x80x80'.decode('utf-8')
to call decode
with 'utf-8'
on the byte string to decode it as a Unicode string.
Conclusion
To fix UnicodeDecodeError, invalid continuation byte with Python, we call decode
to decode the byte string with the right encoding.
Web developer specializing in React, Vue, and front end development.
View Archive