TL, DR
Pandas has great parsing tools to let you import data from multiple file formats (CSV, Excel, etc…). Sometimes in your data you have “NA” strings that are valid values. However, Pandas default will consider those strings as markers for missing values. Below some option you can use to prevent this default and import NA strings in Pandas the right way.
When NA is a valid value
Importing data in Pandas is one of the most common steps in Data Science. You have some data in tabular form, as an Excel File, or a CSV one, and you want to load it for your analysis.
Pandas provides you with simple functions link read_csv
or read_excel
that take care of all the heavy lifting for you. However, there are some corner cases that are handled by default in a way that could seriously mess up your data.
In fact, “NA” strings are by default supposed to indicate missing data. But in several cases “NA” is indeed a valid data point. For instance, it can indicate the Italian city of Naples or the country of Namibia. Or Sodium mineral in completely different contexts.
By default Pandas will parse those strings as missing values, so your DataFrame would not be exactly as you expect. Below a quick example that you can run to check this yourself.
from io import StringIO
import pandas as pd
data = """COUNTY,VALUE
SA,97
NA,105
BN,NaN
AV,87"""
print(pd.read_csv(StringIO(data)))
The code above will give you this result:
COUNTY VALUE
0 SA 97
1 NaN 105
2 BN 85
3 AV 87
You see, our NA data point has been labeled as missing int he COUNTY column. Also the VALUE for BN has also been parsed as missing, this time correctly. The wrong importing will impair any further analysis. Luckily, there are ways to fix this issue.
Change Pandas way to handle NA strings
Let’s see what settings will enable us to import NA strings in Pandas as valid values. There are two key parameters in the read_csv
and read_excel
functions that we can use for this purpose:
keep_default_na
na_values
The first one is a True/False flag to tell Pandas whether to use its defaults values to indicate missing data points. The second one is a list of values that need to be treated as missing data points, and gets activated if the former flag is set to False.
The default values parsed as missing data points, as of March 2022 are:
‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’
In order to handle correctly our toy example, we should then use those options as follow:
from io import StringIO
import pandas as pd
data = """COUNTY,VALUE
SA,97
NA,105
BN,NaN
AV,87"""
print(pd.read_csv(StringIO(data), keep_default_na=False, na_values=['NULL','null', 'nan','NaN']))
This will produce the following output:
COUNTY VALUE
0 SA 97.0
1 NA 105.0
2 BN NaN
3 AV 87.0
As you can see, this time our NA COUNTY has been imported correctly, as well as the missing VALUE for BN COUNTY. Of course, you should implement a list of markers for missing values in line with your need.
Related links
- StackOverflow answers link1 and link2
- Pandas read_csv documentation link
- Pandas read_excel documentation link
Do you like our content? Check more of our posts in our blog!