Handling Duplicates with Pyspark ~ BI-FI Blogs

In PySpark, you can use the dropDuplicates() function to remove duplicate rows from a DataFrame. This function allows you to specify the columns to consider when identifying duplicates, as well as whether to keep the first or the last occurrence of each duplicate row.

However, to check if there are any duplicates in your dataset,

Count the number of distinct rows in a subset and compare it with the number of total rows. If they're the same, there is no duplicates in the selected subset. Otherwise, duplicates exists.

from pyspark.sql import functions as F
cols = ['employee_name', 'department']

counts_df = df.select([
    F.countDistinct(*cols).alias('n_unique'),
    F.count('*').alias('n_rows')])
n_unique, n_rows = counts_df.collect()[0]

if n_rows == n_unique:
    print(f' Total Row {n_rows} and Unique Row {n_unique} are equal. Therefore, 
there are no duplicate rows for the given subset')
else: 
    print(f' Total Row {n_rows} and Unique Row {n_unique} are not equal. Therefore,
 there are {n_rows - n_unique}  duplicate rows for the given 
subset \n Distinct Table below:')
 

Here is an example of how to use dropDuplicates() to remove all duplicate rows from a DataFrame:

df = df.dropDuplicates() # removes all duplicate rows

You can also use the dropDuplicates() function to remove duplicates based on specific columns. For example:

df = df.dropDuplicates(subset=["col1", "col2"]) # removes duplicate rows based 
on the values in columns "col1" and "col2"

You can also use the dropDuplicates() function to keep the first or the last occurrence of each duplicate row. For example:

df = df.dropDuplicates(keep="first") # keeps the first occurrence of each duplicate row
df = df.dropDuplicates(keep="last") # keeps the last occurrence of each duplicate row

It's important to choose the appropriate method for handling duplicate values based on the context and the needs of your analysis.

If you found this post useful, please don't forget to share and leave a comment a below.

Handling Duplicates with Pyspark

Popular Posts

Blog Keywords

Blog Archive

Recent Posts

Views : Last Month

Translate Blogs to your Language