I have a 50K list of strings (city names) and I need a the smallest list of character tri-grams (prefarably n-grams) so that every string is at least once hit by every tri-gram. So if I have a list of say:
['amsterdam', 'rotterdam', 'haarlem', 'utrecht', 'groningen']
the list of identifying trigrams is 4 long and should be (alternatives possible):
['ter', 'haa', 'utr', 'gro']
I thought my solution gives the right answer but trying some other lists resulted in wrong solutions.
from collections import Counter
def identifying_grams(list, n=3):
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def ngrams(text, n=3):
return [text[i:i + n] for i in range(len(text) - n + 1)]
hits = []
trigrams = []
for item in list:
# trigrams += ngrams(item)
trigrams += f7(ngrams(item))
counts = Counter(trigrams).most_common()
for trigram, count in counts:
items = []
for item in list:
if trigram in item:
hits.append(trigram)
items.append(item)
for i in items:
list.remove(i)
return(f7(hits))
list1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen']
print(identifying_grams(list1))
# Good, we get: ['ter', 'haa', 'utr', 'gro']
list2 = ['amsterdam','schiedam']
print(identifying_grams(list2))
# Good, we get: ['dam']
list3 = ['amsterdam','schiedam','terwolde','wolstad']
print(identifying_grams(list3))
# Ouch, we get: ['ter', 'dam', 'wol']
# this should be ['dam', 'wol'] as this is only 2 trigrams that identify the list...
I got two answers so far, but both of them have flaws. The one from Rupesh is good for lists that are smaller then 10 items. My lists have over 50K items. The one from mujjiga does come up with a solution albeit not the perfect one.
A bounty for the Python Ninja who comes up with a perfect solution that scales. Bonus kuddos if it performs well and gives same solution every time it runs!
from Find smallest list of unique n-grams in a list of strings
0 komentar:
Posting Komentar