Skip to main content

Python Pandas: Concatenate Strings from Multiple Rows after GroupBy

When working with Pandas DataFrames, a common data aggregation task after grouping is to concatenate string values from multiple rows within each group into a single string (or a list of strings). For example, you might want to compile a comma-separated list of all products purchased by each customer or all tasks assigned to an employee on a given date.

This guide explains how to effectively use groupby() in conjunction with methods like transform(), apply(), and agg() to concatenate strings from multiple rows.

The Goal: String Aggregation within Groups

Given a Pandas DataFrame, we want to:

  1. Group the rows based on the values in one or more columns.
  2. For each group, take the string values from another specified column.
  3. Concatenate these string values into a single string (e.g., separated by commas, spaces, or another delimiter) or into a list of strings for each group.

Example DataFrame

import pandas as pd

data = {
'OrderID': [101, 101, 102, 103, 103, 103, 104],
'Customer': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice'],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk', 'Bread', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
OrderID Customer Product Category
0 101 Alice Apple Fruit
1 101 Alice Banana Fruit
2 102 Bob Orange Fruit
3 103 Charlie Apple Fruit
4 103 Charlie Milk Dairy
5 103 Charlie Bread Bakery
6 104 Alice Orange Fruit

We might want to get a comma-separated list of all 'Product's for each 'OrderID' or 'Customer'.

Method 1: Using groupby().transform(','.join) (Adds Concatenated String to Original DataFrame Shape)

The transform() method is useful when you want to perform a group-wise calculation and then broadcast the result back to a Series or DataFrame that has the same index as your original DataFrame.

Grouping by a Single Column

Let's concatenate 'Product' names for each 'OrderID'.

import pandas as pd

df_example = pd.DataFrame({
'OrderID': [101, 101, 102, 103, 103],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk']
})

# Group by 'OrderID', then transform the 'Product' column by joining strings
# This creates a new Series with the same index as df_example
df_example['Products_In_Order'] = df_example.groupby('OrderID')['Product'].transform(lambda x: ','.join(x))

print("DataFrame with concatenated products per order (using transform):")
print(df_example)
print()

# To get unique rows after transform:
df_unique_orders = df_example[['OrderID', 'Products_In_Order']].drop_duplicates().reset_index(drop=True)
print("Unique orders with concatenated products:")
print(df_unique_orders)

Output:

DataFrame with concatenated products per order (using transform):
OrderID Product Products_In_Order
0 101 Apple Apple,Banana
1 101 Banana Apple,Banana
2 102 Orange Orange
3 103 Apple Apple,Milk
4 103 Milk Apple,Milk

Unique orders with concatenated products:
OrderID Products_In_Order
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk

df.groupby('OrderID')['Product']: Selects the 'Product' Series for each 'OrderID' group. .transform(lambda x: ','.join(x)): For each group's 'Product' Series (x), it joins all strings with a comma. The result is aligned back to the original DataFrame's index. drop_duplicates(): Necessary if you want one row per group in the final output after using transform this way.

Grouping by Multiple Columns

Concatenate 'Product' for each ('Customer', 'Category') combination.

import pandas as pd

data = {
'OrderID': [101, 101, 102, 103, 103, 103, 104],
'Customer': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice'],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk', 'Bread', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit']
}

df = pd.DataFrame(data)
df_multi_group = df.copy()

df_multi_group['Products_Per_Customer_Category'] = df_multi_group.groupby(
['Customer', 'Category']
)['Product'].transform(lambda x: ' | '.join(x)) # Using a different separator

print("DataFrame with concatenated products per customer & category (transform):")
print(df_multi_group)
print()

df_unique_cust_cat = df_multi_group[['Customer', 'Category', 'Products_Per_Customer_Category']].drop_duplicates().reset_index(drop=True)
print(df_unique_cust_cat)

Output:

DataFrame with concatenated products per customer & category (transform):
OrderID Customer Product Category Products_Per_Customer_Category
0 101 Alice Apple Fruit Apple | Banana | Orange
1 101 Alice Banana Fruit Apple | Banana | Orange
2 102 Bob Orange Fruit Orange
3 103 Charlie Apple Fruit Apple
4 103 Charlie Milk Dairy Milk
5 103 Charlie Bread Bakery Bread
6 104 Alice Orange Fruit Apple | Banana | Orange

Customer Category Products_Per_Customer_Category
0 Alice Fruit Apple | Banana | Orange
1 Bob Fruit Orange
2 Charlie Fruit Apple
3 Charlie Dairy Milk
4 Charlie Bakery Bread

Method 2: Using groupby().apply(','.join) (Creates an Aggregated Series/DataFrame)

The apply() method on a GroupBy object can also be used. When applied to a grouped Series (like df.groupby('ColA')['ColB']), it passes each group (as a Series) to the function.

Grouping by a Single Column

import pandas as pd

df_example = pd.DataFrame({
'OrderID': [101, 101, 102, 103, 103],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk']
})

# Group by 'OrderID' and apply ','.join to the 'Product' Series of each group
aggregated_products_apply = df_example.groupby('OrderID')['Product'].apply(lambda x: ','.join(x))
# This returns a Series where the index is 'OrderID' and values are concatenated strings

print("Concatenated products per order (using apply):")
print(aggregated_products_apply)
print()

# To get a DataFrame:
aggregated_products_df_apply = aggregated_products_apply.reset_index(name='ProductList_String')
print("As a DataFrame:")
print(aggregated_products_df_apply)

Output:

Concatenated products per order (using apply):
OrderID
101 Apple,Banana
102 Orange
103 Apple,Milk
Name: Product, dtype: object

As a DataFrame:
OrderID ProductList_String
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk

Grouping by Multiple Columns

The grouping columns become a MultiIndex in the resulting Series.

import pandas as pd

data = {
'OrderID': [101, 101, 102, 103, 103, 103, 104],
'Customer': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice'],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk', 'Bread', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit']
}

df = pd.DataFrame(data)
df_full = df.copy()

aggregated_multi_apply = df_full.groupby(['Customer', 'Category'])['Product'].apply(lambda x: ' | '.join(sorted(x.unique())))
# Using sorted(x.unique()) to get unique products in a defined order per group

print("Concatenated unique products per Customer & Category (apply):")
print(aggregated_multi_apply.reset_index(name='Product_Summary'))

Output:

Concatenated products per order (using apply):
OrderID
101 Apple,Banana
102 Orange
103 Apple,Milk
Name: Product, dtype: object

As a DataFrame:
OrderID ProductList_String
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk

Applying list to get a list of strings per group

If you want a list of strings instead of a single concatenated string:

import pandas as pd

data = {
'OrderID': [101, 101, 102, 103, 103, 103, 104],
'Customer': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice'],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk', 'Bread', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit']
}

df = pd.DataFrame(data)
df_full = df.copy()

list_of_products_per_group = df_full.groupby('OrderID')['Product'].apply(list)
print("List of products per OrderID (apply(list)):")
print(list_of_products_per_group)

Output:

List of products per OrderID (apply(list)):
OrderID
101 [Apple, Banana]
102 [Orange]
103 [Apple, Milk, Bread]
104 [Orange]
Name: Product, dtype: object

Method 3: Using groupby().agg(','.join) or agg(list) (Similar to apply)

The agg() (or aggregate()) method is designed for applying one or more aggregation functions. You can pass ''.join (or a lambda for a specific separator) or list.

import pandas as pd

df_example = pd.DataFrame({
'OrderID': [101, 101, 102, 103, 103],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk']
})

# --- Using agg with a custom join function ---
# Aggregating products into a single string per OrderID with a custom separator (',')
aggregated_products_agg_join = df_example.groupby('OrderID')['Product'].agg(lambda x: ','.join(x))

print("Concatenated products per order (using agg):")
print(aggregated_products_agg_join.reset_index(name='Products_Agg'))

# --- Using agg with 'list' to aggregate as lists ---
aggregated_products_agg_list = df_example.groupby('OrderID')['Product'].agg(list)
print("\nList of products per order (using agg(list)):")
print(aggregated_products_agg_list.reset_index(name='Product_List_Agg'))

# --- Named Aggregation for multiple columns (Modern Syntax) ---
# Since 'Date' and 'Name' don't exist in the current DataFrame, we'll perform a simpler example
# Aggregating by 'OrderID' with the join function for 'Product'
result_named_agg = df_example.groupby(['OrderID'], as_index=False).agg(
Joined_Products=('Product', lambda x: ','.join(x))
)

print("\nNamed Aggregation (Modern Pandas Syntax):")
print(result_named_agg)

Output:

Concatenated products per order (using agg):
OrderID Products_Agg
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk

List of products per order (using agg(list)):
OrderID Product_List_Agg
0 101 [Apple, Banana]
1 102 [Orange]
2 103 [Apple, Milk]

Named Aggregation (Modern Pandas Syntax):
OrderID Joined_Products
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk

Choosing the Right Method

  • groupby().transform(separator.join): Use when you want the concatenated string added as a new column to your original DataFrame, aligned with its original index. You'll then likely drop_duplicates() to get one row per group.
  • groupby().apply(separator.join) or groupby().agg(separator.join): Use when you want a new Series (or DataFrame after reset_index()) where the index represents the groups and the values are the concatenated strings. These are direct aggregation methods.
    • The apply() method is often more flexible for custom functions on groups, while agg() is powerful for applying multiple aggregations or named aggregations. For simply joining strings or creating lists, both work well.
  • Use list instead of separator.join with apply or agg if you want a list of strings per group rather than a single concatenated string.

Conclusion

Concatenating strings from multiple rows within groups in a Pandas DataFrame is a common data summarization technique.

  1. df.groupby(group_cols)[str_col].transform(lambda x: separator.join(x)) is suitable if you need the concatenated string aligned with the original DataFrame's rows (often followed by drop_duplicates()).
  2. df.groupby(group_cols)[str_col].apply(lambda x: separator.join(x)) or df.groupby(group_cols)[str_col].agg(lambda x: separator.join(x)) are excellent for producing an aggregated Series or DataFrame with one concatenated string per group. Substitute list for the lambda function if you need lists of strings.

Choose the method based on whether you need to modify the original DataFrame structure via transform or create a new aggregated summary. Remember to select the specific string column you want to concatenate after the groupby() call.