Skip to main content

Dedup Component

The Dedup Component is designed to identify and remove duplicate records in a dataset, ensuring data uniqueness and accuracy.

Scenarios

When the LLM component generates datasets through prompts, duplicates may arise. These duplicates can impact the accuracy of model evaluation and fine-tuning. The Dedup Component simplifies the process of detecting and removing such duplicates.

Example Pipeline

Consider the following pipeline for generating questions based on a document:

Pipeline Structure

Capabilities

The Dedup Component can:

  1. Deduplicate Points of Interest: Remove duplicate entries from the list of points of interest.
  2. Deduplicate Generated Questions: Ensure uniqueness in generated questions.

For example, to deduplicate points of interest, the component configuration is as follows:

Dedup Component Configuration

In this configuration, the component will find and remove duplicate entries in the list_of_interest field.

Configuration

Configuration Steps

  1. Configure the Deduplication Field: Specify the field by which the component will identify and remove duplicates. The component will automatically detect and delete duplicate data based on this field's value.

Component Input

  • Type: List of dictionaries.
  • Description: Each dictionary in the list has consistent fields.

Component Output

  • Type: Processed list of dictionaries.
  • Description: The output list is deduplicated based on the values of the configured field, ensuring no duplicate entries.