Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
liamcal2301
Regular Visitor

Remove Duplicate Rows based on column and keep row data

I am using a dataflow and want to achieve the following in power query m. 

I have a table with columns as shown below. The entity_number column repeats in some rows. If this occurs, I want to keep the row where the parent_entity_number is not null (has data). While doing this, I also want to keep all the rest of the columns for that row. 

I tried grouping but have only found how to keep the 2 rows I am working with.

liamcal2301_0-1734637965829.png

 

7 REPLIES 7
v-heq-msft
Community Support
Community Support

Hi @liamcal2301 ,
Thanks to all of them for their efforts.

In order to better demonstrate the implementation of the code, I added a duplicate line of data to the original data to check whether it was successful or not

Sample data

entity_number entity_name parent_entity_number parent_entity_name email phone
100 school1 200 school1D1 school100@mailinator.com 000-123-1231
100 school1   null school102@mailinator.com 000-123-1232
100 school1   null school102@mailinator.com 000-123-1232
105 school2   null school105@mailinator.com 000-123-1232
105 school2 300 school1D2 school103@mailinator.com 000-123-1232
105 school2   null sd111@mailinator.com 000-123-1232
let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMjQwUNJRKk7OyM/PMQSyjJD5LoYItoGBQ25iZk5mXmJJfpFecn4uUMrAwEDX0MgYhA2VYnUwTcsrzclBUDCTjPCaZER1k0zhOoxwmmRKhknGKGFlhGSaMeXuSjE0NCRkSiwA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [entity_number = _t, entity_name = _t, parent_entity_number = _t, parent_entity_name = _t, email = _t, phone = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"entity_number", Int64.Type}, {"entity_name", type text}, {"parent_entity_number", Int64.Type}, {"parent_entity_name", type text}, {"email", type text}, {"phone", type text}}),
    FilteredRows = Table.SelectRows(#"Changed Type", each [parent_entity_number] <> null),
    NullParentRows = Table.SelectRows(#"Changed Type", each [parent_entity_number] = null),
    GroupedRows = Table.Group(NullParentRows, {"entity_number", "entity_name", "email", "phone"}, {{"AllData", each _, type table [entity_number=Int64.Type, entity_name=Text.Type, parent_entity_number=Int64.Type, parent_entity_name=Text.Type, email=Text.Type, phone=Text.Type]}}),
    DistinctRows = Table.TransformColumns(GroupedRows, {"AllData", each Table.Distinct(_)}),
    ExpandedRows = Table.ExpandTableColumn(DistinctRows, "AllData", { "parent_entity_number", "parent_entity_name"}),
    FinalTable = Table.Combine({FilteredRows, ExpandedRows}),
    #"Sorted Rows" = Table.Sort(FinalTable,{{"entity_number", Order.Ascending}})
in
    #"Sorted Rows"

Final output

vheqmsft_0-1734917572643.png

 

Best regards,
Albert He


If this post helps, then please consider Accept it as the solution to help the other members find it more quickly

 

PwerQueryKees
Impactful Individual
Impactful Individual

Starting with this:

PwerQueryKees_0-1734793648426.png

Using this code:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"entity_number", Int64.Type}, {"entity_name", type text}, {"parent_entity_number", Int64.Type}, {"parent_entity_name", type text}, {"email", type text}, {"phone", type text}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Is Final Row", each Table.RowCount(Table.SelectRows(#"Changed Type", (row) => row[entity_number] = [entity_number] and [parent_entity_number] <> null)) > 0),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Is Final Row] = true))
in
    #"Filtered Rows"

Produced this:

PwerQueryKees_1-1734793764497.png

 

NOTE

When there are multiple entity_number with a populated parent_entity_number, you will get them both.

In your sample data there is no way to decide which one to keep.
I could imagine that:

  • the highest entity_number should be kept
    This is slightly more complicated of course, but doable.
  • that the final entity_number after a chain of parents should be kept.
    Like 100 is the parenet of 200 is the parent of 150. You only want to keep 150.
    This is even more complicated and will be very slow on large volumes. But is is doable.

Did I answer your question? Then please mark my post as the solution and make it easier to find for others having a similar problem.
If I helped you, please click on the Thumbs Up to give Kudos.

 

Kees Stolker

A big fan of Power Query and Excel

ZhangKun
Resolver III
Resolver III

I think what you mean is: if there are multiple identical entity_numbers, keep the row with "parent_entity_number<>null"; if there is only one entity_number, keep the row directly.

If this is the case, you can do it as follows:

1. Group the entity_number column

2. Determine whether the table in the grouping column has only one row. If there is only one row, keep it, otherwise select only the row with "parent_entity_number<>null"

3. Delete other rows outside the grouping row

4. Expand the grouping column

The code for the grouping part is similar to the following:

 

Table.Group(
        Your_Table, 
        "entity_number", 
        {
            "Group", 
            each if Table.RowCount(_) = 1 then _ else Table.SelectRows(_, each [parententity_number] <> null)
        }
    )

 

Omid_Motamedise
Memorable Member
Memorable Member

Sort the table based on parent_entity_number in ASSENDING, BY DOING THIS THE FORMULA AS Table.Sort(...) is created, add Table.Buffer to it and rewrite it as Table.Buffer(Table.Sort(.....)) then in the next step, right click on entity_number and pick remove duplicate.

Sorry if i left out this part to add some context.

If a row has entity_number 123 but no parent_entity_number, and it is not duplicated so only 1 row with no parent, then i still would like to keep that row as well.

jgeddes
Super User
Super User

If you filter the parent_entity_number column to not include null values does that give you your desired result?
The code would be 
Table.SelectRows(tableName, each [parent_entity_number] <> null)




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!





No, I would still want to keep the row even if it does not include a parent_entity_number.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.