PySpark | How to remove duplicates from Dataframe?

PySpark | How to remove duplicates from Dataframe?

When working with large datasets in PySpark, it’s common to encounter duplicate records that can skew your analysis or cause issues in downstream processing. Fortunately, PySpark provides some methods to identify and remove duplicate rows from a DataFrame, ensuring that the data is clean and ready for analysis. In this article, we’ll explore two methods to remove duplicates from a PySpark DataFrame: dropDuplicates() and distinct().

PySpark offers two primary methods to eliminate duplicates:

1) dropDuplicates() – Remove duplicates based on specific columns.
2) distinct() – Remove complete duplicate rows.

Method 1: Using dropDuplicates(): This method allows you to remove duplicate rows based on specific columns. This is especially useful when you want to ensure that certain columns are unique while retaining other columns in the DataFrame.

Syntax: DataFrame.dropDuplicates([subset])

Parameter:
subset: Optional. A list of column names to check for duplicates. 
If not specified, it considers all columns for identifying duplicates.

Example 1: Removing Duplicates Based on a Single Column: Let’s say you have a DataFrame with customer data, and you want to remove duplicates based on the CustomerID column.

# Import necessary modules
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Remove Duplicates Example").getOrCreate()

# Sample DataFrame with duplicates
data = [("Alice", 1, "2024-01-01"), ("Bob", 2, "2024-01-01"),
("Alice", 1, "2024-02-01"), ("Catherine", 3, "2024-01-01")]
columns = ["Name", "CustomerID", "Date"]
df = spark.createDataFrame(data, columns)

# Remove duplicates based on CustomerID
df_no_duplicates = df.dropDuplicates(["CustomerID"])

# Show the result
df_no_duplicates.show()

Output:

+----------+----------+----------+
| Name|CustomerID| Date|
+----------+----------+----------+
| Alice| 1|2024-01-01|
| Bob| 2|2024-01-01|
| Catherine| 3|2024-01-01|
+----------+----------+----------+

In this example:

  • The DataFrame contains duplicate entries for CustomerID = 1.
  • The dropDuplicates([“CustomerID”]) method removes one of the duplicate entries for Alice, keeping only the first occurrence.

Example 2: Removing Duplicates Based on Multiple Columns: You can also remove duplicates based on multiple columns. For instance, if you want to remove duplicates where both CustomerID and Date are the same, you can specify both columns.

# Remove duplicates based on CustomerID and Date
df_no_duplicates_multiple = df.dropDuplicates(["CustomerID", "Date"])

# Show the result
df_no_duplicates_multiple.show()

Output:

+----------+----------+----------+
| Name|CustomerID| Date|
+----------+----------+----------+
| Alice| 1|2024-01-01|
| Bob| 2|2024-01-01|
| Alice| 1|2024-02-01|
| Catherine| 3|2024-01-01|
+----------+----------+----------+

Here, only rows where both CustomerID and Date are the same are considered duplicates. Therefore, both entries for Alice are retained because they have different Date values.

Method 2:Using distinct(): This method removes all duplicate rows from a DataFrame. Unlike dropDuplicates(), which focuses on specific columns, distinct() considers the entire row for identifying duplicates.

Syntax:DataFrame.distinct()

Example: Removing All Duplicate Rows: Let’s consider a simple example where we want to remove all duplicate rows from the DataFrame, regardless of which columns contain duplicate values.

# Sample DataFrame with duplicate rows
data_with_duplicates = [("Alice", 1, "2024-01-01"), ("Alice", 1, "2024-01-01"),
("Bob", 2, "2024-01-01"), ("Catherine", 3, "2024-01-01")]
df_with_duplicates = spark.createDataFrame(data_with_duplicates, columns)

# Remove all duplicate rows
df_distinct = df_with_duplicates.distinct()

# Show the result
df_distinct.show()

Output:

+----------+----------+----------+
| Name|CustomerID| Date|
+----------+----------+----------+
| Alice| 1|2024-01-01|
| Bob| 2|2024-01-01|
| Catherine| 3|2024-01-01|
+----------+----------+----------+

In this case, the distinct() method removes the exact duplicate row for Alice, ensuring that each row in the DataFrame is unique.

Conclusion: Removing duplicates from a PySpark DataFrame is an essential task for ensuring data quality and integrity. PySpark provides two main methods to accomplish this:

  • dropDuplicates() allows you to remove duplicates based on specific columns, giving you control over which aspects of the data need to be unique.
  • distinct() eliminates duplicate rows entirely, ensuring every row is unique across all columns.

Leave a Reply

Your email address will not be published. Required fields are marked *

📢 Need further clarification or have any questions? Let's connect!

Connect 1:1 With Me: Schedule Call


If you have any doubts or would like to discuss anything related to this blog, feel free to reach out to me. I'm here to help! You can schedule a call by clicking on the above given link.
I'm looking forward to hearing from you and assisting you with any inquiries you may have. Your understanding and engagement are important to me!

This will close in 20 seconds