Unit 2 - Comparisons and Boolean Reductions
CBSE Revision Notes
Class-11 Informatics Practices (New Syllabus)
Unit 2: Data Handling (DH-1)
Comparisons and Boolean Reductions
Flexible Comparisons
Series and DataFrame have the binary comparison methods eq
, ne
, lt
, gt
, le
, and ge
whose behavior is analogous to the binary arithmetic operations described above:
In [45]: df.gt(df2)
Out[45]:
one two three
a False False False
b False False False
c False False False
d False False False
In [46]: df2.ne(df)
Out[46]:
one two three
a False False True
b False False False
c False False False
d True False False
These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool
. These boolean
objects can be used in indexing operations.
Boolean Reductions
You can apply the reductions: empty
, any()
, all()
, and bool()
to provide a way to summarize a boolean result.
In [47]: (df > 0).all()
Out[47]:
one False
two False
three False
dtype: bool
In [48]: (df > 0).any()
Out[48]:
one True
two True
three True
dtype: bool
You can reduce to a final boolean value.
In [49]: (df > 0).any().any()
Out[49]: True
You can test if a pandas object is empty, via the empty
property.
In [50]: df.empty
Out[50]: False
In [51]: pd.DataFrame(columns=list('ABC')).empty
Out[51]: True
To evaluate single-element pandas objects in a boolean context, use the method bool()
:
In [52]: pd.Series([True]).bool()
Out[52]: True
In [53]: pd.Series([False]).bool()
Out[53]: False
In [54]: pd.DataFrame([[True]]).bool()
Out[54]: True
In [55]: pd.DataFrame([[False]]).bool()
Out[55]: False
Warning
You might be tempted to do the following:
>>> if df:
...
Or
>>> df and df2
These will both raise errors, as you are trying to compare multiple values.
ValueError: The truth value of an array is ambiguous.
Use a.empty, a.any() or a.all().
See gotchas for a more detailed discussion.
Comparing if objects are equivalent
Often you may find that there is more than one way to compute the same result. As a simple example, consider df+df
and df*2
. To test that these two computations produce the same result, given the tools shown above, you might imagine using (df+df == df*2).all()
. But in fact, this expression is False:
In [56]: df+df == df*2
Out[56]:
one two three
a True True False
b True True True
c True True True
d False True True
In [57]: (df+df == df*2).all()
Out[57]:
one False
two True
three False
dtype: bool
Notice that the boolean DataFrame df+df == df*2
contains some False values! This is because NaNs do not compare as equals:
In [58]: np.nan == np.nan
Out[58]: False
So, NDFrames (such as Series, DataFrames, and Panels) have an equals()
method for testing equality, with NaNs in corresponding locations treated as equal.
In [59]: (df+df).equals(df*2)
Out[59]: True
Note that the Series or DataFrame index needs to be in the same order for equality to be True:
In [60]: df1 = pd.DataFrame({'col':['foo', 0, np.nan]})
In [61]: df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])
In [62]: df1.equals(df2)
Out[62]: False
In [63]: df1.equals(df2.sort_index())
Out[63]: True
Comparing array-like objects
You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:
In [64]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[64]:
0 True
1 False
2 False
dtype: bool
In [65]: pd.Index(['foo', 'bar', 'baz']) == 'foo'
Out[65]: array([ True, False, False], dtype=bool)
Pandas also handles element-wise comparisons between different array-like objects of the same length:
In [66]:pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
Out[66]:
0 True
1 True
2 False
dtype: bool
In [67]:pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])
Out[67]:
0 True
1 True
2 False
dtype: bool
Trying to compare Index
or Series
objects of different lengths will raise a ValueError:
In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
ValueError: Series lengths must match to compare
In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare
Note that this is different from the NumPy behavior where a comparison can be broadcast:
In [68]: np.array([1, 2, 3]) == np.array([2])
Out[68]: array([False, True, False], dtype=bool)
or it can return False if broadcasting can not be done:
In [69]: np.array([1, 2, 3]) == np.array([1, 2])
Out[69]: False
Combining overlapping data sets
A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the other. An example would be two data series representing a particular economic indicator where one is considered to be of “higher quality”. However, the lower quality series might extend further back in history or have more complete data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation is combine_first()
, which we illustrate:
In [70]: df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
....: 'B' : [np.nan, 2., 3., np.nan, 6.]})
....:
In [71]: df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
....: 'B' : [np.nan, np.nan, 3., 4., 6., 8.]})
....:
In [72]: df1
Out[72]:
A B
0 1.0 NaN
1 NaN 2.0
2 3.0 3.0
3 5.0 NaN
4 NaN 6.0
In [73]: df2
Out[73]:
A B
0 5.0 NaN
1 2.0 NaN
2 4.0 3.0
3 NaN 4.0
4 3.0 6.0
5 7.0 8.0
In [74]: df1.combine_first(df2)
Out[74]:
A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0
General DataFrame Combine
The combine_first()
method above calls the more general DataFrame.combine()
. This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same).
So, for instance, to reproduce combine_first()
as above:
In [75]: combiner = lambda x, y: np.where(pd.isna(x), y, x)
In [76]: df1.combine(df2, combiner)
Out[76]:
A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0