Handling Missing Values with Pyspark

 


   PySpark has some methods to handle missing values. You can use the dropna() function to remove rows or columns with missing values from a DataFrame. This function allows you to specify the axis to drop rows or columns, as well as the criteria for identifying missing values.


   To check if there is a missing value on the dataset, use the IsNull and isNotNull functions.

# To check if there is a missing value -- ISNULL
display(df.filter(df['salary'].isNull()))

# To check non null values -- ISNOTNULL
display(df.filter(df['salary'].isNotNull()))


   To drop the Null values, use the na.drop() function. Interchangeable with dropna() function

How parameter takes two values : 

Any  = drop a row if it contains ANY nulls

ALL = drop a row if the ENTIRE row has nulls

df.na.drop(how="any")
df.na.drop(how="all")



   To drop rows that have less than thresh non-null values use the function below. This overwrites the "how parameter".

df.na.drop(thresh=2)

To impute the missing value with a specified value, use the na.fill() method.

   Syntax for this method is :  df.na.fill(new impute value, column(s) for the fill operation)

# If no column given, value will be imputed on the entire dataframe
# na.fill() and fillna() are interchangeable.
df.na.fill('No address', 'address')


   To impute NA values with central tendency measures (mean,mode,stddev ..), use imputer function like below.

from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols = ['salary'],
    outputCols = ["{}_imputed".format(a) for a in ['salary']]
).setStrategy("mean")

# Fit and Transform the imputed mean values in the place of
null values.
imputer.fit(df).transform(df)


   It's important to choose the appropriate method for handling missing values based on the context and the needs of your analysis.


If you found this post useful, please don't forget to share and leave a comment a below.



Share:

Popular Posts