I am getting a ValueError: cannot reindex from a duplicate axis
when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.
Here is my session inside of ipdb
trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum
index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis
error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?
I don’t really understand what ValueError: cannot reindex from a duplicate axis
means, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.
ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')
ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False
Here is the error:
ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis
I tried to reproduce this with a simple example, but I failed
In [32]: import pandas as pd
In [33]: import numpy as np
In [34]: a = np.arange(35).reshape(5,7)
In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [36]: df.values.dtype
Out[36]: dtype('int64')
In [37]: df.loc['sums'] = df.sum(axis=0)
In [38]: df
Out[38]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
sums 70 75 80 85 90 95 100
I am getting a ValueError: cannot reindex from a duplicate axis
when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.
Here is my session inside of ipdb
trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum
index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis
error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?
I don’t really understand what ValueError: cannot reindex from a duplicate axis
means, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.
ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')
ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False
Here is the error:
ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis
I tried to reproduce this with a simple example, but I failed
In [32]: import pandas as pd
In [33]: import numpy as np
In [34]: a = np.arange(35).reshape(5,7)
In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [36]: df.values.dtype
Out[36]: dtype('int64')
In [37]: df.loc['sums'] = df.sum(axis=0)
In [38]: df
Out[38]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
sums 70 75 80 85 90 95 100
Table of Contents
Hide
- Verify if your DataFrame Index contains Duplicate values
- Test which values in an index is duplicate
- Drop rows with duplicate index values
- Prevent duplicate values in a DataFrame index
- Overwrite DataFrame index with a new one
In Python, you will get a valueerror: cannot reindex from a duplicate axis usually when you set an index to a specific value, reindexing or resampling the DataFrame using reindex method.
If you look at the error message “cannot reindex from a duplicate axis“, it means that Pandas DataFrame has duplicate index values. Hence when we do certain operations such as concatenating a DataFrame, reindexing a DataFrame, or resampling a DataFrame in which the index has duplicate values, it will not work, and Python will throw a ValueError.
Verify if your DataFrame Index contains Duplicate values
When you get this error, the first thing you need to do is to check the DataFrame index for duplicate values using the below code.
df.index.is_unique
The index.is_unique
method will return a boolean value. If the index has unique values, it returns True else False.
Test which values in an index is duplicate
If you want to check which values in an index have duplicates, you can use index.duplicated
method as shown below.
df.index.duplicated()
The method returns an array of boolean values. The duplicated values are returned as True in an array.
idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])
idx.duplicated()
Output
array([False, False, True, False, True])
Drop rows with duplicate index values
By using the same index.duplicated
method, we can remove the duplicate values in the DataFrame using the following code.
It will traverse the DataFrame from a top-down approach and ensure all the duplicate values in the index are removed, and the unique values are preserved.
df.loc[~df.index.duplicated(), :]
Alternatively, if you use the latest version, you can even use the method df.drop_duplicates() as shown below.
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
To remove duplicates on specific column(s), use subset
.
>>> df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
To remove duplicates and keep last occurrences, use keep
.
>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0
Prevent duplicate values in a DataFrame index
If you want to ensure Pandas DataFrame without duplicate values in the index, one can set a flag. Setting the allows_duplicate_labels
flag to False will prevent the assignment of duplicate values.
df.flags.allows_duplicate_labels = False
Applying this flag to a DataFrame with duplicate values or assigning duplicate values will result in DuplicateLabelError: Index has duplicates.
Overwrite DataFrame index with a new one
Alternatively, to overwrite your current DataFrame index with a new one:
df.index = new_index
or, use .reset_index:
df.reset_index(level=0, inplace=True)
Remove inplace=True if you want it to return the dataframe.
Srinivas Ramakrishna is a Solution Architect and has 14+ Years of Experience in the Software Industry. He has published many articles on Medium, Hackernoon, dev.to and solved many problems in StackOverflow. He has core expertise in various technologies such as Microsoft .NET Core, Python, Node.JS, JavaScript, Cloud (Azure), RDBMS (MSSQL), React, Powershell, etc.
Cannot reindex DataFrame with duplicated axis
Let’s start by writing some simple Python coder to define a pandas DataFrame. In reality, most probably you will be acquiring your data from an external file, database or API.
import pandas as pd
stamps = ['01-02-23', ['01-02-23','01-02-24'] , '01-02-24', '01-03-24', '02-03-24']
sales_team = ['North', 'South', 'West', 'East', 'South']
revenue = [109.0, 201.0, 156.0, 181.0, 117.0]
rev_df = pd.DataFrame (dict(time = stamps, team = sales_team, revenue = revenue) )
print (rev_df)
We will get the following data set:
time | team | revenue | |
---|---|---|---|
0 | 01-02-23 | North | 109.0 |
1 | [01-02-23, 01-02-24] | South | 201.0 |
2 | 01-02-24 | West | 156.0 |
3 | 01-03-24 | East | 181.0 |
4 | 02-03-24 | South | 117.0 |
As the time column contains a list, we will break down the second row to two different rows using the explode() function.
new_rev_df = rev_df.explode('time')
print(new_rev_df.head())
One feature of explode() is that it replicates indexes. We will get the following data:
time | team | revenue | |
---|---|---|---|
0 | 01-02-23 | North | 109.0 |
1 | 01-02-23 | South | 201.0 |
1 | 01-02-24 | South | 201.0 |
2 | 01-02-24 | West | 156.0 |
3 | 01-03-24 | East | 181.0 |
Trying to re-index the DataFrame so that the time column becomes the index, will fail with a Valueerror exception:
idx = ['time']
new_rev_df.reindex(idx)
The error message will be:
ValueError: cannot reindex on an axis with duplicate labels
I have encountered this error also when invoking the Seaborn library on data containing duplicated indexes.
Fixing the error
There are a couple of ways that can help to circumvent this error messages.
Aggregate the data
We can groupby the data and then save it as a DataFrame. Note that with this option no data is removed from your DataFrame
new_rev_df.groupby(['time','team']).revenue.sum().to_frame()
Remove Duplicated indexes
We can use the pandas loc indexer in order to get rid of any duplicated indexes. Using this option the second duplicated index is removed.
dup_idx = new_rev_df.index.duplicated()
new_rev_df.loc[~dup_idx]
Note: In this tutorial we replicated the problem for cases in which the row index is duplicated. You might as well encounter this issue when working with datasets, typically wide ones, that including duplicated columns.
Related learning
How to merge columns of a Pandas DataFrame object?
The «Cannot Reindex from a Duplicate Axis» error is a common issue faced by developers when working with the Pandas library in Python. This error occurs when a DataFrame or Series operation requires a unique index, but the given index contains duplicate values.
In this guide, we will discuss the causes of this error, provide a step-by-step solution to resolve it, and answer some frequently asked questions related to this issue.
Table of Contents
- Understanding the Error
- Step-by-Step Solution
- FAQs
Understanding the Error
The «Cannot Reindex from a Duplicate Axis» error usually occurs when you try to perform an operation on a DataFrame or Series with a non-unique index. For example, the following code will result in the error because the index contains duplicate values:
import pandas as pd
data = {'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}
index = ['a', 'a', 'b', 'b']
df = pd.DataFrame(data, index=index)
print(df.reindex(['a', 'b', 'c']))
This error is raised because the reindex
method requires a unique index to work properly.
Step-by-Step Solution
To solve the «Cannot Reindex from a Duplicate Axis» error, follow these steps:
Identify the cause of the error: Check your DataFrame or Series for any duplicate values in the index.
Remove or modify duplicate values: There are several ways to handle duplicate index values:
a. Reset the index: You can reset the index to the default integer index using the reset_index
method. This will remove the duplicate index values and add a new column with the old index values.
df.reset_index(inplace=True, drop=False)
b. Create a unique index: If you want to keep a meaningful index, you can modify the duplicate values to create a unique index. For example, you can append a number to the duplicate values to make them unique.
df.index = df.index.where(~df.index.duplicated(), df.index + '_duplicate')
c. Drop duplicate index values: If you want to remove rows with duplicate index values, you can use the duplicated
method along with boolean indexing.
df = df[~df.index.duplicated(keep='first')]
Perform the desired operation: After handling duplicate index values, you can perform the operation that caused the error.
FAQs
1. What does the «Cannot Reindex from a Duplicate Axis» error mean?
This error occurs when an operation requires a unique index, but the given index contains duplicate values. It usually happens when using methods like reindex
, groupby
, or pivot
on a DataFrame or Series with non-unique index values.
2. How can I check if my DataFrame or Series has duplicate index values?
You can use the duplicated
method along with the any
method to check if your DataFrame or Series has duplicate index values:
has_duplicate_indexes = df.index.duplicated().any()
3. How can I find the duplicate index values in my DataFrame or Series?
You can use the duplicated
method along with boolean indexing to find the duplicate index values:
duplicate_indexes = df[df.index.duplicated()].index
4. Can I use the drop_duplicates
method to remove duplicate index values?
No, the drop_duplicates
method is used to remove duplicate rows based on column values, not index values. To remove duplicate index values, refer to the solutions provided in this guide.
5. Can I use the unique
method to create a unique index?
No, the unique
method is used to get unique values of a Series or DataFrame column, not index values. To create a unique index, refer to the solutions provided in this guide.
- Pandas Documentation
- Handling Duplicate Data with Pandas
- Pandas Reset Index