marking duplicates in a csv file

I m stumped with a problem illustrated in the sample below:


I want to detect duplicates in column "PHONE", and mark the subsequent duplicates using the column "REF", with a value pointing to the "ID" of the first item and the value "Yes" for the "DISCARD" column


So, how do I go about it? I tried this code but my logic wasn t right, of course.

import csv
myfile = open("C:UsersEduardoDocumentsTEST2.csv", "rb")
myfile1 = open("C:UsersEduardoDocumentsTEST2.csv", "rb")

dest = csv.writer(open("C:UsersEduardoDocumentsTESTFIXED.csv", "wb"), dialect="excel")

reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key=lambda x: x[2])
for i, row in enumerate(verum):
    if row[2] == verum[i][2]:
        verum[i][3] = row[0]

print verum

Your direction and help would be much appreciated.


The only thing you have to keep in memory while this is running is a map of phone numbers to their IDs.

map = {}
with open(r c:	empinput.csv ,  r ) as fin:
    reader = csv.reader(fin)
    with open(r c:	empoutput.csv ,  w ) as fout:
        writer = csv.writer(fout)
        # omit this if the file has no header row
        for row in reader:
            (id, name, phone, ref, discard) = row
            if map.has_key(phone):
                ref = map[phone]
                discard = "YES"
                map[phone] = id
            writer.writerow((id, name, phone, ref, discard))

Sounds like homework. Since this is a CSV file (and thus changing the record size is next to impossible) you are best off loading the whole file into memory and manipulating it there before writing it out to a new file. Create a list of strings which is the original lines of the file. Then create a map, insert into the the phone number (the key) and the value (the id). Before the insert you look for the number if it already exists, you update the line containing the duplicate phone number. If it isn t already in the map, you insert the (phone, id) pair.

I know one thing. I know you don t have to read the entire file into memory to accomplish this.

import csv
myfile = "C:UsersEduardoDocumentsTEST2.csv"

dest = csv.writer(open("C:UsersEduardoDocumentsTESTFIXED.csv", "wb"), dialect="excel")

phonedict = {}

for row in cvs.reader(open(myfile, "r")):
    # setdefault sets the value to the second argument if it hasn t been set, and then
    # returns what the value in the dictionary is.
    firstid = phonedict.setdefault(row[2], row[0])
    row[3] = firstid
    if firstid is not row[0]:
       row[4] = "Yes"
from operator import itemgetter
from itertools import groupby

import csv
verum = csv.reader(open( data.csv , rb ))

def grouper( verum ):
    for key, grp in groupby(verum,itemgetter(2)):
        # key = phone number, grp = records with that number
        first = grp.next()
        # first item gets its id written into the 4th column
        yield [first[0],first[1],first[2],first[0],  ] #or list(itemgetter(0,1,2,0,4)(first)) 
        for x in grp:
            # all others get the first items id as ref
            yield [x[0],x[1],x[2], first[0], "Yes"]

for line in sorted(grouper(verum), key=itemgetter(0)):
    print line


[ 1 ,  JOHN ,  12345 ,  1 ,   ]
[ 2 ,  PETER ,  6232 ,  2 ,   ]
[ 3 ,  JON ,  12345 ,  1 ,  Yes ]
[ 4 ,  PETERSON ,  6232 ,  2 ,  Yes ]
[ 5 ,  ALEX ,  7854 ,  5 ,   ]
[ 6 ,  JON ,  12345 ,  1 ,  Yes ]

Writing the data back is left to the reader ;-)

