Cleaning data with Pandas

Pandas is a data analysis library in Python, widely used by data scientists. From data wrangling to data cleaning, it offers multiple functionalities to make life easy when working with data. It’s a must-know for anyone looking forward to getting started in the field.

Pandas comes with a wide variety of features, here we’ll focus mainly on the data cleaning aspect.

First, let’s import Pandas as follows,

import pandas as pd

then let’s use the read_csv function to read a “.csv” file, in this tutorial we’ll use the Open Exoplanet Catalogue,

oec = pd.read_csv('oec.csv')

if your using Jupyter notebook you should be seeing something that looks like that.

Screenshot_2017-09-14_19-06-29

It’s a Data Frame made of columns and rows (3584 rows × 25 columns to be exact)

By typing, oec.columns

Columns

We get to see all 25 columns, here were only interested in the planetary mass and it’s surface temperature, these are “PlanetaryMassJpt” and “SurfaceTempK” respectively.

Let’s take a look at those values separately, we can simply do that by typing in oec.”columns name”, without the quotation marks (“”).

pltmasssep

surfacetempsep

As you can see, there are many “NaN” values present in both of those data sets.

By using the dropna() method we can get rid of them.

dropnapltmass

surftempdropna

Note that we’ve lost some rows by dropping those undefined values , we have now 1313 rows  for our “PlanetarymassJpt” data set and 741 rows for our “SuraceTempK” data set.

To make things clean and simple, let’s instantiate oec.PlanetaryMassJpt.dropna() and oec.SurfaceTempK.dropna() to PlanetaryMassJpt_clean and SurfaceTempK_clean.

instantiated

Now that we’re done with that, let’s create a new data frame with our new cleaned data sets. We can create a new data frame by using the DataFrame() method available in Pandas.

pd

Here we still see some “NaN” values, let’s drop them !

dropped

That’s better ! Now we got a clean DataFrame showing us both the PlanetaryMassJpt and SurfaceTempK of our exoplanets.

We could’ve instead pass oec.PlanetaryMassJpt and oec.SurfaceTempK directly into our DataFrame and issue dropna() on it, but for the sake of the tutorial we’ve done it “the other way”.

That was a little introduction to Pandas, there are many more interesting functionalities available, going through each one of them would take alot of time.

Pandas has a great documentation, so you can always experiment and learn by yourself !

Hope you the best !

Advertisements

One thought on “Cleaning data with Pandas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s