How to select randomly (sample) the rows of a dataframe using pandas in python ?

How to select randomly (sample) the rows of a dataframe using pandas in python ?

Daidalos May 24, 2020


Example of how to select randomly (sample) the rows of a dataframe using pandas in python:

1 -- Create a simple dataframe

Créons une simple dataframe avec 5 colonnes et 20 lignes:

>>> import pandas as pd
>>> import numpy as np
>>> data = np.arange(1,101)
>>> data = data.reshape(20,5)
>>> df = pd.DataFrame(data=data,columns=['a','b','c','d','e'])
>>> df
     a   b   c   d    e
0    1   2   3   4    5
1    6   7   8   9   10
2   11  12  13  14   15
3   16  17  18  19   20
4   21  22  23  24   25
5   26  27  28  29   30
6   31  32  33  34   35
7   36  37  38  39   40
8   41  42  43  44   45
9   46  47  48  49   50
10  51  52  53  54   55
11  56  57  58  59   60
12  61  62  63  64   65
13  66  67  68  69   70
14  71  72  73  74   75
15  76  77  78  79   80
16  81  82  83  84   85
17  86  87  88  89   90
18  91  92  93  94   95
19  96  97  98  99  100

2 -- Select randomly rows using the function sample()

To sample a dataframe using pandas, a solution is ti use pandas.DataFrame.sample. Example: let's randomly select 5 rows from the dataframe df defined above:

>>> df_sub_cutoff = df.sample(n=5)
>>> df_sub_cutoff
     a   b   c   d   e
11  56  57  58  59  60
0    1   2   3   4   5
18  91  92  93  94  95
15  76  77  78  79  80
9   46  47  48  49  50

Lets create another sample of size n=5

>>> df_sub_cutoff = df.sample(n=5)
>>> df_sub_cutoff
     a   b   c   d   e
0    1   2   3   4   5
4   21  22  23  24  25
12  61  62  63  64  65
5   26  27  28  29  30
16  81  82  83  84  85

or of size n=2:

>>> df_sub_cutoff = df.sample(n=2)
>>> df_sub_cutoff
     a   b   c   d   e
0    1   2   3   4   5
15  76  77  78  79  80

Note: to always get the same sample, a solution is to use the option "random_state" (with random_state=42 for example):

>>> df_sub_cutoff = df.sample(n=5, random_state = 42)
>>> df_sub_cutoff
     a   b   c   d   e
0    1   2   3   4   5
17  86  87  88  89  90
15  76  77  78  79  80
1    6   7   8   9  10
8   41  42  43  44  45

Lets do it again using random_state = 42, to check that we got the same sample:

>>> df_sub_cutoff = df.sample(n=5, random_state = 42)
>>> df_sub_cutoff
     a   b   c   d   e
0    1   2   3   4   5
17  86  87  88  89  90
15  76  77  78  79  80
1    6   7   8   9  10
8   41  42  43  44  45

3 -- References