How to remove one or multiple columns in a pandas DataFrame in python ?

Daidalos November 01, 2019


Examples of how to remove one or multiple columns in a pandas DataFrame in python:

Remove one column

Lets create a simple dataframe

>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(100, size=(10,10))
>>> data
array([[55, 38, 97, 91, 22, 12, 87, 55, 38, 91],
       [44, 39, 64, 45, 33, 26, 28,  9, 86, 40],
       [10, 98, 55,  3, 99, 41, 25,  8, 36, 47],
       [84, 82, 55, 73, 29, 28, 85, 99,  7, 92],
       [84, 38, 45, 75,  9, 45, 89, 19, 62, 57],
       [75, 62, 86, 46, 82, 25, 83, 61, 74, 17],
       [69, 23, 50, 67,  6, 36, 28, 30,  4, 23],
       [86, 32, 97, 10, 30, 99, 42, 57, 11, 83],
       [31, 47, 89, 91, 34, 94, 34, 93, 32, 78],
       [78, 74, 95, 45, 24, 22, 89, 61, 60, 85]])
>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])
>>> df
   c1  c2  c3  c4  c5  c6  c7  c8  c9  c10
0  55  38  97  91  22  12  87  55  38   91
1  44  39  64  45  33  26  28   9  86   40
2  10  98  55   3  99  41  25   8  36   47
3  84  82  55  73  29  28  85  99   7   92
4  84  38  45  75   9  45  89  19  62   57
5  75  62  86  46  82  25  83  61  74   17
6  69  23  50  67   6  36  28  30   4   23
7  86  32  97  10  30  99  42  57  11   83
8  31  47  89  91  34  94  34  93  32   78
9  78  74  95  45  24  22  89  61  60   85

Remove the column 7

>>> df.drop('c7',1,inplace=True)

returns

>>> df
   c1  c2  c3  c4  c5  c6  c8  c9  c10
0  55  38  97  91  22  12  55  38   91
1  44  39  64  45  33  26   9  86   40
2  10  98  55   3  99  41   8  36   47
3  84  82  55  73  29  28  99   7   92
4  84  38  45  75   9  45  19  62   57
5  75  62  86  46  82  25  61  74   17
6  69  23  50  67   6  36  30   4   23
7  86  32  97  10  30  99  57  11   83
8  31  47  89  91  34  94  93  32   78
9  78  74  95  45  24  22  61  60   85

Remove a list of columns

>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])
>>> df.drop(['c1','c3','c7'],1,inplace=True)
>>> df
   c2  c4  c5  c6  c8  c9  c10
0  38  91  22  12  55  38   91
1  39  45  33  26   9  86   40
2  98   3  99  41   8  36   47
3  82  73  29  28  99   7   92
4  38  75   9  45  19  62   57
5  62  46  82  25  61  74   17
6  23  67   6  36  30   4   23
7  32  10  30  99  57  11   83
8  47  91  34  94  93  32   78
9  74  45  24  22  61  60   85

Remove multiple consecutive columns

>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])
>>> df.drop(df.iloc[:,3:7],1,inplace=True)
>>> df
   c1  c2  c3  c8  c9  c10
0  55  38  97  55  38   91
1  44  39  64   9  86   40
2  10  98  55   8  36   47
3  84  82  55  99   7   92
4  84  38  45  19  62   57
5  75  62  86  61  74   17
6  69  23  50  30   4   23
7  86  32  97  57  11   83
8  31  47  89  93  32   78
9  78  74  95  61  60   85

Remove columns with misssing data (NAN ou NULL)

Lets consider the following dataset train.csv (that can be downloaded on kaggle). To read the file a solution is to use read_csv():

>>> import pandas as pd
>>> data = pd.read_csv('train.csv')

Get dataframe shape

>>> data.shape
(1460, 81)

Get a dataset preview:

>>> df.head()
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1

  LandContour Utilities    ...     PoolArea PoolQC Fence MiscFeature MiscVal  \
0         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
1         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
2         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
3         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
4         Lvl    AllPub    ...            0    NaN   NaN         NaN       0

  MoSold YrSold  SaleType  SaleCondition  SalePrice  
0      2   2008        WD         Normal     208500  
1      5   2007        WD         Normal     181500  
2      9   2008        WD         Normal     223500  
3      2   2006        WD        Abnorml     140000  
4     12   2008        WD         Normal     250000

Print the number of missing data per columns:

>>> column_with_nan = df.columns[df.isnull().any()]
>>> df.shape
(1460, 81)
>>> for column in column_with_nan:
...     print(column, df[column].isnull().sum())
... 
LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406

Remove columns that have more than 50% of missing data:

>>> for column in column_with_nan:
...     if df[column].isnull().sum()*100.0/df_shape[0] > 50:
...             df.drop(column,1, inplace=True)
... 
>>> df.shape
(1460, 77)

References

Links Site
Delete column from pandas DataFrame stackoverflow
How do I get a summary count of missing/NaN data by column in 'pandas'? stackoverflow
How to count nan values in a pandas DataFrame?) stackoverflow
How to count the NaN values in a column in pandas DataFrame) stackoverflow
How to find which columns contain any NaN value in Pandas dataframe (python) stackoverflow
isnull pandas doc
any pandas doc

Licence


Google Ads


Activity


Google Ads