Import NA strings in Pandas the right way

TL, DR

Pandas has great parsing tools to let you import data from multiple file formats (CSV, Excel, etc…). Sometimes in your data you have “NA” strings that are valid values. However, Pandas default will consider those strings as markers for missing values. Below some option you can use to prevent this default and import NA strings in Pandas the right way.

When NA is a valid value

Importing data in Pandas is one of the most common steps in Data Science. You have some data in tabular form, as an Excel File, or a CSV one, and you want to load it for your analysis.

Pandas provides you with simple functions link read_csv or read_excel that take care of all the heavy lifting for you. However, there are some corner cases that are handled by default in a way that could seriously mess up your data.

In fact, “NA” strings are by default supposed to indicate missing data. But in several cases “NA” is indeed a valid data point. For instance, it can indicate the Italian city of Naples or the country of Namibia. Or Sodium mineral in completely different contexts.

By default Pandas will parse those strings as missing values, so your DataFrame would not be exactly as you expect. Below a quick example that you can run to check this yourself.

from io import StringIO
import pandas as pd

data = """COUNTY,VALUE
SA,97
NA,105
BN,NaN
AV,87"""

print(pd.read_csv(StringIO(data)))

The code above will give you this result:

  COUNTY  VALUE
0     SA     97
1    NaN    105
2     BN     85
3     AV     87

You see, our NA data point has been labeled as missing int he COUNTY column. Also the VALUE for BN has also been parsed as missing, this time correctly. The wrong importing will impair any further analysis. Luckily, there are ways to fix this issue.

Change Pandas way to handle NA strings

Let’s see what settings will enable us to import NA strings in Pandas as valid values. There are two key parameters in the read_csv and read_excel functions that we can use for this purpose:

keep_default_na
na_values

The first one is a True/False flag to tell Pandas whether to use its defaults values to indicate missing data points. The second one is a list of values that need to be treated as missing data points, and gets activated if the former flag is set to False.

The default values parsed as missing data points, as of March 2022 are:

‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’

In order to handle correctly our toy example, we should then use those options as follow:

from io import StringIO
import pandas as pd

data = """COUNTY,VALUE
SA,97
NA,105
BN,NaN
AV,87"""

print(pd.read_csv(StringIO(data), keep_default_na=False, na_values=['NULL','null', 'nan','NaN']))

This will produce the following output:

  COUNTY  VALUE
0     SA   97.0
1     NA  105.0
2     BN    NaN
3     AV   87.0

As you can see, this time our NA COUNTY has been imported correctly, as well as the missing VALUE for BN COUNTY. Of course, you should implement a list of markers for missing values in line with your need.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

TL, DR

When NA is a valid value

Change Pandas way to handle NA strings

Related links