English 中文(简体)
How to remove unique, then duplicate dictionaries in a list?
原标题:

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single instances? I gotta say I only recently started getting into Python but its making this project so much easier. I m just a bit stumped on this kind of problem.

So my list looks like this:

[{   file : u /file.txt ,
     line : u line 666 ,
     rule : u A DUPLICATE RULE }

{    file : u /file.txt ,
     line : u line 666 ,
     rule : u A DUPLICATE RULE }

{    file : u /uniquefile.txt ,
     line : u line 999 ,
     rule : u A UNIQUE RULE }]

What I m going for is in the end, the list should look like:

[{   file : u /file.txt ,
     line : u line 666 ,
     rule : u A DUPLICATE RULE }]
最佳回答

One idea is to sort the data. Assume inputdata is your list from above:

from itertools import groupby
from operator import itemgetter

inputdata.sort(key=itemgetter(*inputdata[0])) # ensures order
print [k for k, g in groupby(inputdata) if len(list(g)) > 1]

prints:

[{ line : u line 666 ,  file : u /file.txt ,  rule : u A DUPLICATE RULE }]
问题回答

I always prefer to work with objects instead of dicts, if the fields are the same for every item.

So, I define a class:

class rule(object):
    def __init__(self, file, line, rule):
        self.file = file
        self.line = line
        self.rule = rule

    #Not a "magic" method, just a helper for all the methods below :)
    def _tuple_(self):
        return (self.file, self.line, self.rule)

    def __eq__(self, other):
        return cmp(self, other) == 0

    def __cmp__(self, other):
        return cmp(self._tuple_(), rule._tuple_(other))

    def __hash__(self):
        return hash(self._tuple_())

    def __repr__(self):
        return repr(self._tuple_())

Now, create a list of these objects, and sort it. ruledict_list can be the example data in your question.

rules = [rule(**r) for r in ruledict_list]
rules.sort()

Loop through the (sorted) list, removing unique objects as we go. Finally, create a set, to remove duplicates. The loop will also remove one of each duplicate object, but that doesn t really matter.

pos = 0
while(pos < len(rules)):
    while pos < len(rules)-1 and rules[pos] == rules[pos+1]:
        print "Skipping rule %s" % rules[pos]
        pos+=1
    rules.pop(pos)
rule_set = set(rules)

I d make another dictionary, using the existing dictionaries as keys and the count of occurrences as values. (Python doesn t allow dictionaries to be used as dictionary keys out of the box, but there are a couple of ways of doing that mentioned in this answer.) Then it s just a matter of iterating over it and selecting the keys where the value is greater than 1.

Of course, using dictionaries as keys relies on their contents not changing over time - at least over the time that you need to use the resulting dictionary. (This is why Python doesn t support it natively.)

Another way is to make a counter for each dict data, based on a frozenset of items:

from operator import itemgetter
from collections import defaultdict

counter = defaultdict(int)
for d in inputdata:
    counter[frozenset(d.iteritems())] += 1

result = [dict(item) for item, count in counter.iteritems() if count > 1]
print result

I think that is the best answer so far, because it is very simple to understand and will work linearly.

This answer is based on Steven Huwig s answer. It s similar to his, but I use sorted() on the list so that groupby() works correctly.

Also, since he said "There s probably a more optimal way to check this than len(list(a[1])).", I decided to use some other way to check for non-unique items. Instead of forcing the whole list, I try to call the .next() method on the iterator, twice. If it works twice, there are at least two items in the iterator, and we are done with it; if we get a StopIteration exception on the first or second call to .next() there was zero or one items in the iterator. (Actually, since we got this iterator from itertools.groupby we know it will have at least one item in it.)

Also, instead of using explicit tuple indexing like a[0] and a[1], I used tuple unpacking, since that s what the cool kids seem to be doing these days.

Finally, instead of using a generator expression to compute the list, and using list() to force it to expand out into a list, I simply used a list comprehension.

data = [
    {
         file : u /file.txt ,
         line : u line 666 ,
         rule : u A DUPLICATE RULE 
    },

    {    file : u /uniquefile.txt ,
         line : u line 999 ,
         rule : u A UNIQUE RULE 
    },

    {    file : u /file.txt ,
         line : u line 666 ,
         rule : u A DUPLICATE RULE 
    },

]

from itertools import groupby

def notunique(itr):
    try:
        itr.next()
        itr.next()
        return True
    except StopIteration:
        return False

def unique_list(lst):
    return [key for key, itr in groupby(sorted(lst)) if notunique(itr)]

print(unique_list(data))
>>> import itertools
>>> list(a[0] for a in itertools.groupby(sorted(data)) if len(list(a[1])) > 1)
[{ file : u /file.txt ,  line : u line 666 ,  rule : u A DUPLICATE RULE }]

There s probably a more optimal way to check this than len(list(a[1])).

Edit: I added a call to sorted.

Another option is to create your own data structure instead of using a dict. If you do this, then you can override __cmp__, __eq__ and __hash__. This will give you the ability to then use the set data type in all its glory.

Here s one possible implementation, though I make no promises about the quality of the hash routine I ve provided:

class Thing(object):
    def __init__(self, file, line, rule):
        self.file = file
        self.line = line
        self.rule = rule

    def __cmp__(self, other):
        result = cmp(self.file, other.file)
        if result == 0:
            result = cmp(self.line, other.line)
        if result == 0:
            result = cmp(self.rule, other.rule)
        return result

    def __eq__(self, other):
        return cmp(self, other) == 0

    def __hash__(self):
        return hash(self.file) * hash(self.line) * hash(self.rule)

    def __str__(self):
        return  ,  .join([self.file, self.line, self.rule])

things = [ Thing(u /file.txt , u line 666 , u A DUPLICATE RULE ),
  Thing(u /file.txt , u line 666 , u A DUPLICATE RULE ),
  Thing(u /uniquefile.txt , u line 999 , u A UNIQUE RULE )]

duplicate_things = set()
unique_things = set()
for t in things:
    if t in unique_things:
        duplicate_things.add(t)
    else:
        unique_things.add(t)

If you need to get back to a list, just construct one from the resulting set:

unique_things = list(unique_things)
duplicate_things = list(duplicate_things)

It s a bit more code to create your own class like this, but may give you other options down the road if your program grows in complexity.

Edit

OK, my hands are faster than my eyes tonight, but I think this edit solves the problem pointed out by @nosklo





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签