Python Pandas: Concatenate Strings from Multiple Rows after GroupBy
When working with Pandas DataFrames, a common data aggregation task after grouping is to concatenate string values from multiple rows within each group into a single string (or a list of strings). For example, you might want to compile a comma-separated list of all products purchased by each customer or all tasks assigned to an employee on a given date.
This guide explains how to effectively use groupby()
in conjunction with methods like transform()
, apply()
, and agg()
to concatenate strings from multiple rows.
The Goal: String Aggregation within Groups
Given a Pandas DataFrame, we want to:
- Group the rows based on the values in one or more columns.
- For each group, take the string values from another specified column.
- Concatenate these string values into a single string (e.g., separated by commas, spaces, or another delimiter) or into a list of strings for each group.
Example DataFrame
import pandas as pd
data = {
'OrderID': [101, 101, 102, 103, 103, 103, 104],
'Customer': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice'],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk', 'Bread', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
OrderID Customer Product Category
0 101 Alice Apple Fruit
1 101 Alice Banana Fruit
2 102 Bob Orange Fruit
3 103 Charlie Apple Fruit
4 103 Charlie Milk Dairy
5 103 Charlie Bread Bakery
6 104 Alice Orange Fruit
We might want to get a comma-separated list of all 'Product'
s for each 'OrderID'
or 'Customer'
.
Method 1: Using groupby().transform(','.join)
(Adds Concatenated String to Original DataFrame Shape)
The transform()
method is useful when you want to perform a group-wise calculation and then broadcast the result back to a Series or DataFrame that has the same index as your original DataFrame.
Grouping by a Single Column
Let's concatenate 'Product' names for each 'OrderID'.
import pandas as pd
df_example = pd.DataFrame({
'OrderID': [101, 101, 102, 103, 103],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk']
})
# Group by 'OrderID', then transform the 'Product' column by joining strings
# This creates a new Series with the same index as df_example
df_example['Products_In_Order'] = df_example.groupby('OrderID')['Product'].transform(lambda x: ','.join(x))
print("DataFrame with concatenated products per order (using transform):")
print(df_example)
print()
# To get unique rows after transform:
df_unique_orders = df_example[['OrderID', 'Products_In_Order']].drop_duplicates().reset_index(drop=True)
print("Unique orders with concatenated products:")
print(df_unique_orders)
Output:
DataFrame with concatenated products per order (using transform):
OrderID Product Products_In_Order
0 101 Apple Apple,Banana
1 101 Banana Apple,Banana
2 102 Orange Orange
3 103 Apple Apple,Milk
4 103 Milk Apple,Milk
Unique orders with concatenated products:
OrderID Products_In_Order
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk
df.groupby('OrderID')['Product']
: Selects the 'Product' Series for each 'OrderID' group.
.transform(lambda x: ','.join(x))
: For each group's 'Product' Series (x
), it joins all strings with a comma. The result is aligned back to the original DataFrame's index.
drop_duplicates()
: Necessary if you want one row per group in the final output after using transform
this way.
Grouping by Multiple Columns
Concatenate 'Product' for each ('Customer', 'Category') combination.
import pandas as pd
data = {
'OrderID': [101, 101, 102, 103, 103, 103, 104],
'Customer': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice'],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk', 'Bread', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit']
}
df = pd.DataFrame(data)
df_multi_group = df.copy()
df_multi_group['Products_Per_Customer_Category'] = df_multi_group.groupby(
['Customer', 'Category']
)['Product'].transform(lambda x: ' | '.join(x)) # Using a different separator
print("DataFrame with concatenated products per customer & category (transform):")
print(df_multi_group)
print()
df_unique_cust_cat = df_multi_group[['Customer', 'Category', 'Products_Per_Customer_Category']].drop_duplicates().reset_index(drop=True)
print(df_unique_cust_cat)
Output:
DataFrame with concatenated products per customer & category (transform):
OrderID Customer Product Category Products_Per_Customer_Category
0 101 Alice Apple Fruit Apple | Banana | Orange
1 101 Alice Banana Fruit Apple | Banana | Orange
2 102 Bob Orange Fruit Orange
3 103 Charlie Apple Fruit Apple
4 103 Charlie Milk Dairy Milk
5 103 Charlie Bread Bakery Bread
6 104 Alice Orange Fruit Apple | Banana | Orange
Customer Category Products_Per_Customer_Category
0 Alice Fruit Apple | Banana | Orange
1 Bob Fruit Orange
2 Charlie Fruit Apple
3 Charlie Dairy Milk
4 Charlie Bakery Bread
Method 2: Using groupby().apply(','.join)
(Creates an Aggregated Series/DataFrame)
The apply()
method on a GroupBy object can also be used. When applied to a grouped Series (like df.groupby('ColA')['ColB']
), it passes each group (as a Series) to the function.
Grouping by a Single Column
import pandas as pd
df_example = pd.DataFrame({
'OrderID': [101, 101, 102, 103, 103],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk']
})
# Group by 'OrderID' and apply ','.join to the 'Product' Series of each group
aggregated_products_apply = df_example.groupby('OrderID')['Product'].apply(lambda x: ','.join(x))
# This returns a Series where the index is 'OrderID' and values are concatenated strings
print("Concatenated products per order (using apply):")
print(aggregated_products_apply)
print()
# To get a DataFrame:
aggregated_products_df_apply = aggregated_products_apply.reset_index(name='ProductList_String')
print("As a DataFrame:")
print(aggregated_products_df_apply)
Output:
Concatenated products per order (using apply):
OrderID
101 Apple,Banana
102 Orange
103 Apple,Milk
Name: Product, dtype: object
As a DataFrame:
OrderID ProductList_String
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk
Grouping by Multiple Columns
The grouping columns become a MultiIndex in the resulting Series.
import pandas as pd
data = {
'OrderID': [101, 101, 102, 103, 103, 103, 104],
'Customer': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice'],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk', 'Bread', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit']
}
df = pd.DataFrame(data)
df_full = df.copy()
aggregated_multi_apply = df_full.groupby(['Customer', 'Category'])['Product'].apply(lambda x: ' | '.join(sorted(x.unique())))
# Using sorted(x.unique()) to get unique products in a defined order per group
print("Concatenated unique products per Customer & Category (apply):")
print(aggregated_multi_apply.reset_index(name='Product_Summary'))
Output:
Concatenated products per order (using apply):
OrderID
101 Apple,Banana
102 Orange
103 Apple,Milk
Name: Product, dtype: object
As a DataFrame:
OrderID ProductList_String
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk
Applying list
to get a list of strings per group
If you want a list of strings instead of a single concatenated string:
import pandas as pd
data = {
'OrderID': [101, 101, 102, 103, 103, 103, 104],
'Customer': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Alice'],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk', 'Bread', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Dairy', 'Bakery', 'Fruit']
}
df = pd.DataFrame(data)
df_full = df.copy()
list_of_products_per_group = df_full.groupby('OrderID')['Product'].apply(list)
print("List of products per OrderID (apply(list)):")
print(list_of_products_per_group)
Output:
List of products per OrderID (apply(list)):
OrderID
101 [Apple, Banana]
102 [Orange]
103 [Apple, Milk, Bread]
104 [Orange]
Name: Product, dtype: object
Method 3: Using groupby().agg(','.join)
or agg(list)
(Similar to apply
)
The agg()
(or aggregate()
) method is designed for applying one or more aggregation functions. You can pass ''.join
(or a lambda for a specific separator) or list
.
import pandas as pd
df_example = pd.DataFrame({
'OrderID': [101, 101, 102, 103, 103],
'Product': ['Apple', 'Banana', 'Orange', 'Apple', 'Milk']
})
# --- Using agg with a custom join function ---
# Aggregating products into a single string per OrderID with a custom separator (',')
aggregated_products_agg_join = df_example.groupby('OrderID')['Product'].agg(lambda x: ','.join(x))
print("Concatenated products per order (using agg):")
print(aggregated_products_agg_join.reset_index(name='Products_Agg'))
# --- Using agg with 'list' to aggregate as lists ---
aggregated_products_agg_list = df_example.groupby('OrderID')['Product'].agg(list)
print("\nList of products per order (using agg(list)):")
print(aggregated_products_agg_list.reset_index(name='Product_List_Agg'))
# --- Named Aggregation for multiple columns (Modern Syntax) ---
# Since 'Date' and 'Name' don't exist in the current DataFrame, we'll perform a simpler example
# Aggregating by 'OrderID' with the join function for 'Product'
result_named_agg = df_example.groupby(['OrderID'], as_index=False).agg(
Joined_Products=('Product', lambda x: ','.join(x))
)
print("\nNamed Aggregation (Modern Pandas Syntax):")
print(result_named_agg)
Output:
Concatenated products per order (using agg):
OrderID Products_Agg
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk
List of products per order (using agg(list)):
OrderID Product_List_Agg
0 101 [Apple, Banana]
1 102 [Orange]
2 103 [Apple, Milk]
Named Aggregation (Modern Pandas Syntax):
OrderID Joined_Products
0 101 Apple,Banana
1 102 Orange
2 103 Apple,Milk
Choosing the Right Method
groupby().transform(separator.join)
: Use when you want the concatenated string added as a new column to your original DataFrame, aligned with its original index. You'll then likelydrop_duplicates()
to get one row per group.groupby().apply(separator.join)
orgroupby().agg(separator.join)
: Use when you want a new Series (or DataFrame afterreset_index()
) where the index represents the groups and the values are the concatenated strings. These are direct aggregation methods.- The
apply()
method is often more flexible for custom functions on groups, whileagg()
is powerful for applying multiple aggregations or named aggregations. For simply joining strings or creating lists, both work well.
- The
- Use
list
instead ofseparator.join
withapply
oragg
if you want a list of strings per group rather than a single concatenated string.
Conclusion
Concatenating strings from multiple rows within groups in a Pandas DataFrame is a common data summarization technique.
df.groupby(group_cols)[str_col].transform(lambda x: separator.join(x))
is suitable if you need the concatenated string aligned with the original DataFrame's rows (often followed bydrop_duplicates()
).df.groupby(group_cols)[str_col].apply(lambda x: separator.join(x))
ordf.groupby(group_cols)[str_col].agg(lambda x: separator.join(x))
are excellent for producing an aggregated Series or DataFrame with one concatenated string per group. Substitutelist
for the lambda function if you need lists of strings.
Choose the method based on whether you need to modify the original DataFrame structure via transform
or create a new aggregated summary. Remember to select the specific string column you want to concatenate after the groupby()
call.