### Unit 2 - Comparisons and Boolean Reductions

CBSE Revision Notes

Class-11 Informatics Practices (New Syllabus)
Unit 2: Data Handling (DH-1)

Comparisons and Boolean Reductions

Flexible Comparisons

Series and DataFrame have the binary comparison methods `eq``ne``lt``gt``le`, and `ge` whose behavior is analogous to the binary arithmetic operations described above:

`In : df.gt(df2)Out:    one  two threea False False Falseb False False Falsec False False Falsed False False FalseIn : df2.ne(df)Out:    one  two threea False False  Trueb False False Falsec False False Falsed  True False False`

These operations produce a pandas object of the same type as the left-hand-side input that is of dtype `bool`. These `boolean` objects can be used in indexing operations.

Boolean Reductions

You can apply the reductions: `empty``any()``all()`, and `bool()` to provide a way to summarize a boolean result.

`In : (df > 0).all()Out: one   Falsetwo   Falsethree  Falsedtype: boolIn : (df > 0).any()Out: one   Truetwo   Truethree  Truedtype: bool`

You can reduce to a final boolean value.

`In : (df > 0).any().any()Out: True`

You can test if a pandas object is empty, via the `empty` property.

`In : df.emptyOut: FalseIn : pd.DataFrame(columns=list('ABC')).emptyOut: True`

To evaluate single-element pandas objects in a boolean context, use the method `bool()`:

`In : pd.Series([True]).bool()Out: TrueIn : pd.Series([False]).bool()Out: FalseIn : pd.DataFrame([[True]]).bool()Out: TrueIn : pd.DataFrame([[False]]).bool()Out: False`

Warning

You might be tempted to do the following:

`>>> if df:   ...`

Or

`>>> df and df2`

These will both raise errors, as you are trying to compare multiple values.

`ValueError: The truth value of an array is ambiguous.Use a.empty, a.any() or a.all().`

See gotchas for a more detailed discussion.

Comparing if objects are equivalent

Often you may find that there is more than one way to compute the same result. As a simple example, consider `df+df` and `df*2`. To test that these two computations produce the same result, given the tools shown above, you might imagine using `(df+df == df*2).all()`. But in fact, this expression is False:

`In : df+df == df*2Out:    one  two threea  True True Falseb  True True  Truec  True True  Trued False True  TrueIn : (df+df == df*2).all()Out: one   Falsetwo    Truethree  Falsedtype: bool`

Notice that the boolean DataFrame `df+df == df*2` contains some False values! This is because NaNs do not compare as equals:

`In : np.nan == np.nanOut: False`

So, NDFrames (such as Series, DataFrames, and Panels) have an `equals()` method for testing equality, with NaNs in corresponding locations treated as equal.

`In : (df+df).equals(df*2)Out: True`

Note that the Series or DataFrame index needs to be in the same order for equality to be True:

`In : df1 = pd.DataFrame({'col':['foo', 0, np.nan]})In : df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])In : df1.equals(df2)Out: FalseIn : df1.equals(df2.sort_index())Out: True`

Comparing array-like objects

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:

`In : pd.Series(['foo', 'bar', 'baz']) == 'foo'Out: 0   True1  False2  Falsedtype: boolIn : pd.Index(['foo', 'bar', 'baz']) == 'foo'Out: array([ True, False, False], dtype=bool)`

Pandas also handles element-wise comparisons between different array-like objects of the same length:

`In :pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])Out: 0   True1   True2  Falsedtype: boolIn :pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])Out: 0   True1   True2  Falsedtype: bool`

Trying to compare `Index` or `Series` objects of different lengths will raise a ValueError:

`In : pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])ValueError: Series lengths must match to compareIn : pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])ValueError: Series lengths must match to compare`

Note that this is different from the NumPy behavior where a comparison can be broadcast:

`In : np.array([1, 2, 3]) == np.array()Out: array([False, True, False], dtype=bool)`

or it can return False if broadcasting can not be done:

`In : np.array([1, 2, 3]) == np.array([1, 2])Out: False`

Combining overlapping data sets

A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the other. An example would be two data series representing a particular economic indicator where one is considered to be of “higher quality”. However, the lower quality series might extend further back in history or have more complete data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation is `combine_first()`, which we illustrate:

`In : df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],  ....:           'B' : [np.nan, 2., 3., np.nan, 6.]})  ....: In : df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],  ....:           'B' : [np.nan, np.nan, 3., 4., 6., 8.]})  ....: In : df1Out:    A  B0 1.0 NaN1 NaN 2.02 3.0 3.03 5.0 NaN4 NaN 6.0In : df2Out:    A  B0 5.0 NaN1 2.0 NaN2 4.0 3.03 NaN 4.04 3.0 6.05 7.0 8.0In : df1.combine_first(df2)Out:    A  B0 1.0 NaN1 2.0 2.02 3.0 3.03 5.0 4.04 3.0 6.05 7.0 8.0`

General DataFrame Combine

The `combine_first()` method above calls the more general `DataFrame.combine()`. This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same).

So, for instance, to reproduce `combine_first()` as above:

`In : combiner = lambda x, y: np.where(pd.isna(x), y, x)In : df1.combine(df2, combiner)Out:    A  B0 1.0 NaN1 2.0 2.02 3.0 3.03 5.0 4.04 3.0 6.05 7.0 8.0`