Pandas is a data analysis library in Python, widely used by data scientists. From data wrangling to data cleaning, it offers multiple functionalities to make life easy when working with data. It’s a must-know for anyone looking forward to getting started in the field.
Pandas comes with a wide variety of features, here we’ll focus mainly on the data cleaning aspect.
First, let’s import Pandas as follows,
import pandas as pd
then let’s use the read_csv function to read a “.csv” file, in this tutorial we’ll use the Open Exoplanet Catalogue,
oec = pd.read_csv('oec.csv')
if your using Jupyter notebook you should be seeing something that looks like that.
It’s a Data Frame made of columns and rows (3584 rows × 25 columns to be exact)
We get to see all 25 columns, here were only interested in the planetary mass and it’s surface temperature, these are “PlanetaryMassJpt” and “SurfaceTempK” respectively.
Let’s take a look at those values separately, we can simply do that by typing in oec.”columns name”, without the quotation marks (“”).
As you can see, there are many “NaN” values present in both of those data sets.
By using the
dropna() method we can get rid of them.
Note that we’ve lost some rows by dropping those undefined values , we have now 1313 rows for our “PlanetarymassJpt” data set and 741 rows for our “SuraceTempK” data set.
To make things clean and simple, let’s instantiate oec.PlanetaryMassJpt.dropna() and oec.SurfaceTempK.dropna() to PlanetaryMassJpt_clean and SurfaceTempK_clean.
Now that we’re done with that, let’s create a new data frame with our new cleaned data sets. We can create a new data frame by using the
DataFrame() method available in Pandas.
Here we still see some “NaN” values, let’s drop them !
That’s better ! Now we got a clean DataFrame showing us both the PlanetaryMassJpt and SurfaceTempK of our exoplanets.
We could’ve instead pass oec.PlanetaryMassJpt and oec.SurfaceTempK directly into our DataFrame and issue
dropna() on it, but for the sake of the tutorial we’ve done it “the other way”.
That was a little introduction to Pandas, there are many more interesting functionalities available, going through each one of them would take alot of time.
Pandas has a great documentation, so you can always experiment and learn by yourself !
Hope you the best !