1

EDIT: I have stripped down the file to the bits that are problematic

raw_data = {"link":
           ['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
            'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
            'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
            'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
            'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
            'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
            'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}

df = pd.DataFrame(raw_data, columns = ["link"])

#duplicate check #1

a = print(df.iloc[12][0])
b = print(df.iloc[13][0])

if a == b:
    print("equal")

#duplicate check #2

df.duplicated()

For the first test I get the following output implying there is a duplicate

https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal

For the second test it seems there are no duplicates

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool

Original post:

Trying to identify duplicate values from the "Link" column of attached file:

data file

import pandas as pd

data = pd.read_csv(r"...\consolidated.csv", sep=",")

df = pd.DataFrame(data)

del df['Unnamed: 0']

duplicate_rows = df[df.duplicated(["Link"], keep="first")]

pd.DataFrame(duplicate_rows)

#a = print(df.iloc[42657][15])
#b = print(df.iloc[42676][15])

#if a == b:
#    print("equal")

Used the code above, but the answer I keep getting is that there are no duplicates. Checked it through Excel and there should be seven duplicate instances. Even selected specific cells to do a quick check (the part marked with #s), and the values have been identified as equal. Yet duplicated does not capture them

I have been scratching my head for a good hour, and still no idea what I'm missing - help appreciated!

2
  • 1
    Include relevant test data (optimally as executable code to create the dataframe) in the question, not as link. Check that it shows your problem. Commented Dec 9, 2019 at 18:11
  • I have included the specific bits that cause me trouble in the edit, hope it clarifies the question. Commented Dec 10, 2019 at 12:59

2 Answers 2

1

I had the same problem and converting the columns of the dataframe to "str" helped.

eg.

df['link'] = df['link'].astype(str)
duplicate_rows = df[df.duplicated(["link"], keep="first")]
Sign up to request clarification or add additional context in comments.

Comments

0

First, you don't need df = pd.DataFrame(data), as data = pd.read_csv(r"...\consolidated.csv", sep=",") already returns a Dataframe.

As for the deletion of duplicates, check the drop_duplicates method in the Documentation

Hope this helps.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.