### Unit 2 - Missing data and filling values

CBSE Revision Notes

Class-11 Informatics Practices (New Syllabus)
Unit 2: Data Handling (DH-1)

Missing data and filling values

Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

When and Why Is Data Missed?

Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.

Let us now see how we can handle missing values (say NA or NaN) using Pandas.

`# import the pandas libraryimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df`

Its output is as follows −

`         one        two      threea   0.077988   0.476149   0.965836b        NaN        NaN        NaNc  -0.390208  -0.551605  -2.301950d        NaN        NaN        NaNe  -2.000303  -0.788201   1.510072f  -0.930230  -0.670473   1.146615g        NaN        NaN        NaNh   0.085100   0.532791   0.887415`

Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

Check for Missing Values

To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −

Example 1

`import pandas as pdimport numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df['one'].isnull()`

Its output is as follows −

`a  Falseb  Truec  Falsed  Truee  Falsef  Falseg  Trueh  FalseName: one, dtype: bool`

Example 2

`import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df['one'].notnull()`

Its output is as follows −

`a  Trueb  Falsec  Trued  Falsee  Truef  Trueg  Falseh  TrueName: one, dtype: bool`

Calculations with Missing Data

• When summing data, NA will be treated as Zero
• If the data are all NA, then the result will be NA

Example 1

`import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df['one'].sum()`

Its output is as follows −

`2.02357685917`

Example 2

`import pandas as pdimport numpy as npdf = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])print df['one'].sum()`

Its output is as follows −

`nan`

Cleaning / Filling Missing Data

Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

Replace NaN with a Scalar Value

The following program shows how you can replace "NaN" with "0".

`import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c'])print dfprint ("NaN replaced with '0':")print df.fillna(0)`

Its output is as follows −

`         one        two     threea  -0.576991  -0.741695  0.553172b        NaN        NaN       NaNc   0.744328  -1.735166  1.749580NaN replaced with '0':         one        two     threea  -0.576991  -0.741695  0.553172b   0.000000   0.000000  0.000000c   0.744328  -1.735166  1.749580`

Here, we are filling with value zero; instead we can also fill with any other value.

Fill NA Forward and Backward

Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.

MethodAction
bfill/backfillFill methods Backward

Example 1

`import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df.fillna(method='pad')`

Its output is as follows −

`         one        two      threea   0.077988   0.476149   0.965836b   0.077988   0.476149   0.965836c  -0.390208  -0.551605  -2.301950d  -0.390208  -0.551605  -2.301950e  -2.000303  -0.788201   1.510072f  -0.930230  -0.670473   1.146615g  -0.930230  -0.670473   1.146615h   0.085100   0.532791   0.887415`

Example 2

`import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df.fillna(method='backfill')`

Its output is as follows −

`         one        two      threea   0.077988   0.476149   0.965836b  -0.390208  -0.551605  -2.301950c  -0.390208  -0.551605  -2.301950d  -2.000303  -0.788201   1.510072e  -2.000303  -0.788201   1.510072f  -0.930230  -0.670473   1.146615g   0.085100   0.532791   0.887415h   0.085100   0.532791   0.887415`

Drop Missing Values

If you want to simply exclude the missing values, then use the dropnafunction along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

Example 1

`import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df.dropna()`

Its output is as follows −

`         one        two      threea   0.077988   0.476149   0.965836c  -0.390208  -0.551605  -2.301950e  -2.000303  -0.788201   1.510072f  -0.930230  -0.670473   1.146615h   0.085100   0.532791   0.887415`

Example 2

`import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df.dropna(axis=1)`

Its output is as follows −

`Empty DataFrameColumns: [ ]Index: [a, b, c, d, e, f, g, h]`

Replace Missing (or) Generic Values

Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.

Replacing NA with a scalar value is equivalent behavior of the fillna()function.

Example 1

`import pandas as pdimport numpy as npdf = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})print df.replace({1000:10,2000:60})`

Its output is as follows −

`   one  two0   10   101   20    02   30   303   40   404   50   505   60   60`

Example 2

`import pandas as pdimport numpy as npdf = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})print df.replace({1000:10,2000:60})`

Its output is as follows −

`   one  two0   10   101   20    02   30   303   40   404   50   505   60   60`