### Unit 2 - Matching and Broadcasting operations

CBSE Revision Notes

Class-11 Informatics Practices (New Syllabus)
Unit 2: Data Handling (DH-1)

DataFrame has the methods `add()``sub()``mul()``div()` and related functions `radd()``rsub()`, … for carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these functions, you can use to either match on the index or columns via the axis keyword:

`In : df = pd.DataFrame({'one' : pd.Series(np.random.randn(3),                             index=['a', 'b', 'c']),   ....: 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),   ....: 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})   ....: In : dfOut:         one       two     threea -1.101558  1.124472       NaNb -0.177289  2.487104 -0.634293c  0.462215 -0.486066  1.931194d       NaN -0.456288 -1.222918In : row = df.ilocIn : column = df['two']In : df.sub(row, axis='columns')Out:         one       two     threea -0.924269 -1.362632       NaNb  0.000000  0.000000  0.000000c  0.639504 -2.973170  2.565487d       NaN -2.943392 -0.588625In : df.sub(row, axis=1)Out:         one       two     threea -0.924269 -1.362632       NaNb  0.000000  0.000000  0.000000c  0.639504 -2.973170  2.565487d       NaN -2.943392 -0.588625In : df.sub(column, axis='index')Out:         one  two     threea -2.226031  0.0       NaNb -2.664393  0.0 -3.121397c  0.948280  0.0  2.417260d       NaN  0.0 -0.766631In : df.sub(column, axis=0)Out:         one  two     threea -2.226031  0.0       NaNb -2.664393  0.0 -3.121397c  0.948280  0.0  2.417260d       NaN  0.0 -0.766631`

Furthermore you can align a level of a multi-indexed DataFrame with a Series.

`In : dfmi = df.copy()In : dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),                                 (1,'b'),(1,'c'),(2,'a')],   ....:                names=['first','second'])   ....: In : dfmi.sub(column, axis=0, level='second')Out:                    one      two     threefirst second                             1     a      -2.226031  0.00000       NaN      b      -2.664393  0.00000 -3.121397      c       0.948280  0.00000  2.4172602     a            NaN -1.58076 -2.347391`

With Panel, describing the matching behavior is a bit more difficult, so the arithmetic methods instead (and perhaps confusingly?) give you the option to specify the broadcast axis. For example, suppose we wished to demean the data over a particular axis. This can be accomplished by taking the mean over an axis and broadcasting over the same axis:

`In : major_mean = wp.mean(axis='major')In : major_meanOut:       Item1     Item2A -0.878036 -0.092218B -0.060128  0.529811C  0.099453 -0.715139D  0.248599 -0.186535In : wp.sub(major_mean, axis='major')Out: <class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)Items axis: Item1 to Item2Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00Minor_axis axis: A to D`

And similarly for `axis="items"` and `axis="minor"`.

Note

I could be convinced to make the axis argument in the DataFrame methods match the broadcasting behavior of Panel. Though it would require a transition period so users can change their code…

Series and Index also support the `divmod()` builtin. This function takes the floor division and modulo operation at the same time returning a two-tuple of the same type as the left hand side. For example:

`In : s = pd.Series(np.arange(10))In : sOut: 0    01    12    23    34    45    56    67    78    89    9dtype: int64In : div, rem = divmod(s, 3)In : divOut: 0    01    02    03    14    15    16    27    28    29    3dtype: int64In : remOut: 0    01    12    23    04    15    26    07    18    29    0dtype: int64In : idx = pd.Index(np.arange(10))In : idxOut: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')In : div, rem = divmod(idx, 3)In : divOut: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')In : remOut: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')`

We can also do elementwise `divmod()`:

`In : div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])In : divOut: 0    01    02    03    14    15    16    17    18    19    1dtype: int64In : remOut: 0    01    12    23    04    05    16    17    28    29    3dtype: int64`

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

NumPy operations are usually done on pairs of arrays on an element-by-element basis. In the simplest case, the two arrays must have exactly the same shape, as in the following example:

`>>> a = np.array([1.0, 2.0, 3.0])>>> b = np.array([2.0, 2.0, 2.0])>>> a * barray([ 2.,  4.,  6.])`

NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. The simplest broadcasting example occurs when an array and a scalar value are combined in an operation:

`>>> a = np.array([1.0, 2.0, 3.0])>>> b = 2.0>>> a * barray([ 2.,  4.,  6.])`

The result is equivalent to the previous example where `b` was an array. We can think of the scalar `b` being stretched during the arithmetic operation into an array with the same shape as `a`. The new elements in `b` are simply copies of the original scalar. The stretching analogy is only conceptual. NumPy is smart enough to use the original scalar value without actually making copies, so that broadcasting operations are as memory and computationally efficient as possible.

The code in the second example is more efficient than that in the first because broadcasting moves less memory around during the multiplication (`b` is a scalar rather than an array).

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

1. they are equal, or
2. one of them is 1

If these conditions are not met, a `ValueError: frames are not aligned` exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.

Arrays do not need to have the same number of dimensions. For example, if you have a `256x256x3` array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:

`Image  (3d array): 256 x 256 x 3Scale  (1d array):             3Result (3d array): 256 x 256 x 3`

When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.

In the following example, both the `A` and `B` arrays have axes with length one that are expanded to a larger size during the broadcast operation:

`A      (4d array):  8 x 1 x 6 x 1B      (3d array):      7 x 1 x 5Result (4d array):  8 x 7 x 6 x 5`

Here are some more examples:

`A      (2d array):  5 x 4B      (1d array):      1Result (2d array):  5 x 4A      (2d array):  5 x 4B      (1d array):      4Result (2d array):  5 x 4A      (3d array):  15 x 3 x 5B      (3d array):  15 x 1 x 5Result (3d array):  15 x 3 x 5A      (3d array):  15 x 3 x 5B      (2d array):       3 x 5Result (3d array):  15 x 3 x 5A      (3d array):  15 x 3 x 5B      (2d array):       3 x 1Result (3d array):  15 x 3 x 5`

Here are examples of shapes that do not broadcast:

`A      (1d array):  3B      (1d array):  4 # trailing dimensions do not matchA      (2d array):      2 x 1B      (3d array):  8 x 4 x 3 # second from last dimensions mismatched`

An example of broadcasting in practice:

`>>> x = np.arange(4)>>> xx = x.reshape(4,1)>>> y = np.ones(5)>>> z = np.ones((3,4))>>> x.shape(4,)>>> y.shape(5,)>>> x + y<type 'exceptions.ValueError'>: shape mismatch: objects cannot                           be broadcast to a single shape>>> xx.shape(4, 1)>>> y.shape(5,)>>> (xx + y).shape(4, 5)>>> xx + yarray([[ 1.,  1.,  1.,  1.,  1.],       [ 2.,  2.,  2.,  2.,  2.],       [ 3.,  3.,  3.,  3.,  3.],       [ 4.,  4.,  4.,  4.,  4.]])>>> x.shape(4,)>>> z.shape(3, 4)>>> (x + z).shape(3, 4)>>> x + zarray([[ 1.,  2.,  3.,  4.],       [ 1.,  2.,  3.,  4.],       [ 1.,  2.,  3.,  4.]])`

Broadcasting provides a convenient way of taking the outer product (or any other outer operation) of two arrays. The following example shows an outer addition operation of two 1-d arrays:

`>>> a = np.array([0.0, 10.0, 20.0, 30.0])>>> b = np.array([1.0, 2.0, 3.0])>>> a[:, np.newaxis] + barray([[  1.,   2.,   3.],       [ 11.,  12.,  13.],       [ 21.,  22.,  23.],       [ 31.,  32.,  33.]])`

Here the `newaxis` index operator inserts a new axis into `a`, making it a two-dimensional `4x1` array. Combining the `4x1` array with `b`, which has shape `(3,)`, yields a `4x3` array.