Unit 2 - Comparisons and Boolean Reductions

CBSE Revision Notes

Class-11 Informatics Practices (New Syllabus)
Unit 2: Data Handling (DH-1)

Comparisons and Boolean Reductions

Flexible Comparisons

Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogous to the binary arithmetic operations described above:

In [45]: df.gt(df2)
Out[45]: 
   one  two three
a False False False
b False False False
c False False False
d False False False

In [46]: df2.ne(df)
Out[46]: 
   one  two three
a False False  True
b False False False
c False False False
d  True False False

These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool. These boolean objects can be used in indexing operations.

Boolean Reductions

You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [47]: (df > 0).all()
Out[47]: 
one   False
two   False
three  False
dtype: bool

In [48]: (df > 0).any()
Out[48]: 
one   True
two   True
three  True
dtype: bool

You can reduce to a final boolean value.

In [49]: (df > 0).any().any()
Out[49]: True

You can test if a pandas object is empty, via the empty property.

In [50]: df.empty
Out[50]: False

In [51]: pd.DataFrame(columns=list('ABC')).empty
Out[51]: True

To evaluate single-element pandas objects in a boolean context, use the method bool():

In [52]: pd.Series([True]).bool()
Out[52]: True

In [53]: pd.Series([False]).bool()
Out[53]: False

In [54]: pd.DataFrame([[True]]).bool()
Out[54]: True

In [55]: pd.DataFrame([[False]]).bool()
Out[55]: False

Warning

You might be tempted to do the following:

>>> if df:
   ...

>>> df and df2

These will both raise errors, as you are trying to compare multiple values.

ValueError: The truth value of an array is ambiguous.
Use a.empty, a.any() or a.all().

See gotchas for a more detailed discussion.

Comparing if objects are equivalent

Often you may find that there is more than one way to compute the same result. As a simple example, consider df+df and df*2. To test that these two computations produce the same result, given the tools shown above, you might imagine using (df+df == df*2).all(). But in fact, this expression is False:

In [56]: df+df == df*2
Out[56]: 
   one  two three
a  True True False
b  True True  True
c  True True  True
d False True  True

In [57]: (df+df == df*2).all()
Out[57]: 
one   False
two    True
three  False
dtype: bool

Notice that the boolean DataFrame df+df == df*2 contains some False values! This is because NaNs do not compare as equals:

In [58]: np.nan == np.nan
Out[58]: False

So, NDFrames (such as Series, DataFrames, and Panels) have an equals() method for testing equality, with NaNs in corresponding locations treated as equal.

In [59]: (df+df).equals(df*2)
Out[59]: True

Note that the Series or DataFrame index needs to be in the same order for equality to be True:

In [60]: df1 = pd.DataFrame({'col':['foo', 0, np.nan]})

In [61]: df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])

In [62]: df1.equals(df2)
Out[62]: False

In [63]: df1.equals(df2.sort_index())
Out[63]: True

Comparing array-like objects

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:

In [64]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[64]: 
0   True
1  False
2  False
dtype: bool

In [65]: pd.Index(['foo', 'bar', 'baz']) == 'foo'
Out[65]: array([ True, False, False], dtype=bool)

Pandas also handles element-wise comparisons between different array-like objects of the same length:

In [66]:pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
Out[66]: 
0   True
1   True
2  False
dtype: bool

In [67]:pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])
Out[67]: 
0   True
1   True
2  False
dtype: bool

Trying to compare Index or Series objects of different lengths will raise a ValueError:

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare

Note that this is different from the NumPy behavior where a comparison can be broadcast:

In [68]: np.array([1, 2, 3]) == np.array([2])
Out[68]: array([False, True, False], dtype=bool)

or it can return False if broadcasting can not be done:

In [69]: np.array([1, 2, 3]) == np.array([1, 2])
Out[69]: False

Combining overlapping data sets

A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the other. An example would be two data series representing a particular economic indicator where one is considered to be of “higher quality”. However, the lower quality series might extend further back in history or have more complete data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation is combine_first(), which we illustrate:

In [70]: df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
  ....:           'B' : [np.nan, 2., 3., np.nan, 6.]})
  ....: 

In [71]: df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
  ....:           'B' : [np.nan, np.nan, 3., 4., 6., 8.]})
  ....: 

In [72]: df1
Out[72]: 
   A  B
0 1.0 NaN
1 NaN 2.0
2 3.0 3.0
3 5.0 NaN
4 NaN 6.0

In [73]: df2
Out[73]: 
   A  B
0 5.0 NaN
1 2.0 NaN
2 4.0 3.0
3 NaN 4.0
4 3.0 6.0
5 7.0 8.0

In [74]: df1.combine_first(df2)
Out[74]: 
   A  B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0

General DataFrame Combine

The combine_first() method above calls the more general DataFrame.combine(). This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same).

So, for instance, to reproduce combine_first() as above:

In [75]: combiner = lambda x, y: np.where(pd.isna(x), y, x)

In [76]: df1.combine(df2, combiner)
Out[76]: 
   A  B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0