Find rows with multiple duplicate fields with Active Record, Rails & Postgres

Question

What is the best way to find records with duplicate values across multiple columns using Postgres, and Activerecord?

I found this solution here:

User.find(:all, :group => [:first, :email], :having => "count(*) > 1" )

But it doesn't seem to work with postgres. I'm getting this error:

PG::GroupingError: ERROR: column "parts.id" must appear in the GROUP BY clause or be used in an aggregate function

In regular SQL, I'd use a self-join, something like select a.id, b.id, name, email FROM user a INNER JOIN user b USING (name, email) WHERE a.id > b.id. No idea how to express that in ActiveRecord-speak. — Craig Ringer
– Craig Ringer, Commented Feb 10, 2014 at 4:48

Jeremie Ges · Accepted Answer · 2018-12-09 11:38:24Z

297

Tested & Working Version

User.select(:first,:email).group(:first,:email).having("count(*) > 1")

Also, this is a little unrelated but handy. If you want to see how times each combination was found, put .size at the end:

User.select(:first,:email).group(:first,:email).having("count(*) > 1").size

and you'll get a result set back that looks like this:

{[nil, nil]=>512,
 ["Joe", "[email protected]"]=>23,
 ["Jim", "[email protected]"]=>36,
 ["John", "[email protected]"]=>21}

Thought that was pretty cool and hadn't seen it before.

Credit to Taryn, this is just a tweaked version of her answer.

edited Dec 9, 2018 at 11:38

Jeremie Ges

2,7633 gold badges25 silver badges39 bronze badges

answered Feb 14, 2014 at 2:08

newUserNameHere

18.1k18 gold badges53 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Rafael Oliveira Over a year ago

I had to pass an explict array to select() as in: User.select([:first,:email]).group(:first,:email).having("count(*) > 1").count in order to work.

Magne Over a year ago

adding the .count gives PG::UndefinedFunction: ERROR: function count

Serhii Nadolynskyi Over a year ago

You can try User.select([:first,:email]).group(:first,:email).having("count(*) > 1").map.count

Ashbury Over a year ago

I'm trying the same method but trying to get the User.id as well, adding it to the select and group returns an empty array. How can I return the whole User model, or at least include the :id?

Jade Hamel Over a year ago

use .sizeinstead of .count

|

Taryn East · Accepted Answer · 2014-02-16 11:32:16Z

44

That error occurs because POSTGRES requires you to put grouping columns in the SELECT clause.

try:

User.select(:first,:email).group(:first,:email).having("count(*) > 1").all

(note: not tested, you may need to tweak it)

EDITED to remove id column

edited Feb 16, 2014 at 11:32

answered Feb 10, 2014 at 4:42

Taryn East

27.8k9 gold badges88 silver badges110 bronze badges

2 Comments

Craig Ringer Over a year ago

That's not going to work; the id column is not part of the group, so you cannot refer it unless you aggregate it (e.g. array_agg(id) or json_agg(id))

PaddyDwyer Over a year ago

Just to add onto the comment. the above would become User.select("arrag_agg(id) as ids").select(:first,:email).group(:first,:email).having("count(*) > 1").

Ben Aubin · Accepted Answer · 2017-12-04 01:08:45Z

20

If you need the full models, try the following (based on @newUserNameHere's answer).

User.where(email: User.select(:email).group(:email).having("count(*) > 1").select(:email))

This will return the rows where the email address of the row is not unique.

I'm not aware of a way to do this over multiple attributes.

answered Dec 4, 2017 at 1:08

Ben Aubin

5,7072 gold badges36 silver badges54 bronze badges

3 Comments

chet corey Over a year ago

``` User.where(email: User.select(:email).group(:email).having("count(*) > 1")) ```

chet corey Over a year ago

Thank you that works great :) Also seems like it the last .select(:email) is redundant. I think this is a little cleaner, but I could be wrong. User.where(email: User.select(:email).group(:email).having("count(*) > 1"))

Yshmarov Over a year ago

perfect! this solution finds ActiveRecord instances. just what I was looking for

nikolayp · Accepted Answer · 2018-10-24 12:09:26Z

8

Get all duplicates with a single query if you use PostgreSQL:

def duplicated_users
  duplicated_ids = User
    .group(:first, :email)
    .having("COUNT(*) > 1")
    .select('unnest((array_agg("id"))[2:])')

  User.where(id: duplicated_ids)
end

irb> duplicated_users

answered Oct 24, 2018 at 12:09

nikolayp

18.1k4 gold badges69 silver badges68 bronze badges

Comments

thisismydesign · Accepted Answer · 2022-06-21 15:46:28Z

3

I struggled to get proper User models returned via the accepted answer. Here's how:

User
  .group(:first, :email)
  .having("COUNT(*) > 1")
  .select('array_agg("id") as ids')
  .map(&:ids)
  .map { |group| group.map { |id| User.find(id) } }

This will return proper models you can interact with as:

[
  [User#1, User#2],
  [User#35, User#59],
]

answered Jun 21, 2022 at 15:46

thisismydesign

25.7k17 gold badges150 silver badges147 bronze badges

Comments

Dorian · Accepted Answer · 2021-10-02 05:02:58Z

0

Works well in raw SQL:

# select array_agg(id) from attendances group by event_id, user_id having count(*) > 1;
   array_agg   
---------------
 {3712,3711}
 {8762,8763}
 {7421,7420}
 {13478,13477}
 {15494,15493}

answered Oct 2, 2021 at 5:02

Dorian

9,4116 gold badges51 silver badges74 bronze badges

Comments

J_McCaffrey · Accepted Answer · 2022-04-14 21:05:22Z

Building on @itsnikolay 's answer above but making a method that you can pass any ActiveRecord scope to

#pass in a scope, and list of columns to group by
# map(&:dupe_ids) to see your list 
def duplicate_row_ids(ar_scope, attrs)
  ar_scope
    .group(attrs)
    .having("COUNT(*) > 1")
    .select('array_agg("id") as dupe_ids')      
end

 #initial scope to narrow where you want to look for dupes
 ar_scope = ProductReviews.where( product_id: "194e676b-741e-4143-a0ce-10cf268290bb", status: "Rejected")
#pass the scope, and list of columns to group by
results = duplicate_row_ids(ar_scope, [:nickname, :overall_rating, :source, :product_id, :headline, :status])
#get your list
id_pairs = results.map &:dupe_ids
#each entry is an array
#then go through your pairs and take action

Community · Accepted Answer · 2017-05-23 10:31:31Z

-1

Based on the answer above by @newUserNameHere I believe the right way to show the count for each is

res = User.select('first, email, count(1)').group(:first,:email).having('count(1) > 1')

res.each {|r| puts r.attributes } ; nil

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Mar 22, 2016 at 13:24

Nuno Costa

1,25011 silver badges12 bronze badges

Collectives™ on Stack Overflow

Find rows with multiple duplicate fields with Active Record, Rails & Postgres

8 Answers 8

10 Comments

2 Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

10 Comments

2 Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related