Advent of Code Day 4 - High-Entropy Passphrases¶

AKA. Let's talk about Sets, baby.¶

Today's (2017/12/04) AdventOfCode puzzle highlights the power of Python sets to perform quite powerful data operations.

Sets are a data structure that seem to get less love than they probably deserve, given that they provide a powerful way to manipulate unique data. I will briefly discuss the way that sets operate, and how we can use some of their properties to rapidly solve todays problem.

Part One:¶

A new system policy has been put in place that requires all accounts to use a passphrase instead of simply a password. A passphrase consists of a series of words (lowercase letters) separated by spaces.

To ensure security, a valid passphrase must contain no duplicate words.

For example:

aa bb cc dd ee is valid.
aa bb cc dd aa is not valid - the word aa appears more than once.
aa bb cc dd aaa is valid - aa and aaa count as different words.

The system's full passphrase list is available as your puzzle input. How many passphrases are valid?

So here what we need to do is look at each "word" in an input line and check that it doesn't appear twice. If it does appear twice, the "passphrase" is invalid. So how can sets help us?

A set is fundamentally a mathematical object which contains a set of unique items. In mathematics this will typically be numbers but when extended to computer science it can be any object which you can test for uniqueness in some way. This post in not intended to go into great detail on the theory though, so check out the links below if you are interested in that.

The python documentation says

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference. (For other containers see the built-in dict, list, and tuple classes, and the collections module.)

Like other collections, sets support x in set, len(set), and for x in set. Being an unordered collection, sets do not record element position or order of insertion. Accordingly, sets do not support indexing, slicing, or other sequence-like behavior.

Okay, that's a lot of information in a small space. Lets pull out the parts that are relevant to this problem.

The most important part of a set is the word "distinct", this means that a set cannot contain the same element more than once. We can see this in operation by taking a list with repeated elements and creating a set from that list:

count = ["one", "hahaha", "two", "hahaha", "three", "hahaha"]
countset = set(count)
print(countset)

{'hahaha', 'three', 'one', 'two'}

In making a set from the list, the duplicates have been removed, but also the order appears to have been lost. For our purposes in solving this puzzle that is okay though as we only care about the presence of words, not their order.

As a side note the order is not so much lost as changed by python to allow the set to be searched more efficiently, this is done by arranging the elements in a hash table to allow easy searching for specific elements. The operations that can be performed on sets such as intersection, union, and difference make use of this because they need to rapidly look up many elements in both sets in order to check for their presence or absence. Although the elements are in fact sorted by their hash, that hash bears no relation to what humans consider to be a sensible sort order, and so the elements appear randomised. If order important to what you are doing, you will need a different data structure, which may be better suited to your purposes at the cost of being less efficient to perform set operations.

So now we have an easy way of getting all the unique words in a passphrase we can compare the length of the set of unique words to the length of the complete list. If they match, the passphrase has no repeated words and is valid. This can be implemented as:

def day_4_1(passphrases):
    """Take the passphrases as a list of strings and return a count of the valid ones.
    """
    # keep a count
    valid_passphrases = 0
    #iterate over the list of passphrases
    for passphrase in passphrases:
        wordlist = passphrase.split() #split the string at every whitespace (the default)
        wordset = set(wordlist)
        if len(wordlist) == len(wordset):
            valid_passphrases += 1
    
    return valid_passphrases

This is exactly what is implemented (albeit more tersely) in my github repository of (nearly) oneliner solutions to AOC2017.

Part two is going to look very similar to this, with an added twist.

Part Two:¶

For added security, yet another system policy has been put in place. Now, a valid passphrase must contain no two words that are anagrams of each other - that is, a passphrase is invalid if any word's letters can be rearranged to form any other word in the passphrase.

For example:

abcde fghij is a valid passphrase.
abcde xyz ecdab is not valid - the letters from the third word can be rearranged to form the first word.
a ab abc abd abf abj is a valid passphrase, because all letters need to be used when forming another word.
iiii oiii ooii oooi oooo is valid.
oiii ioii iioi iiio is not valid - any of these words can be rearranged to form any other word.

Under this new system policy, how many passphrases are valid?

Now we need to look at the individual words more closely, making sure that the same set of letters is not repeated in any order. In this case we still need to build our set of words, but first we need to make sure we have a good way of comparing them. There are a few ways we could do this, but the most straightforward way is simply to sort each word before we compare them. Then regardless of the original order, if we have two anagrams we can be sure that after sorting they will be identical strings of letters. This makes our second solution just one line longer than the first:

def day_4_2(passphrases):
    """Take the passphrases as a list of strings and return a count of the valid ones.
    """
    # keep a count
    valid_passphrases = 0
    #iterate over the list of passphrases
    for passphrase in passphrases:
        wordlist = passphrase.split() #split the string at every whitespace (the default)
        wordlist = ["".join(sorted(word)) for word in wordlist] # sort each word and turn back into a string
        wordset = set(wordlist)
        if len(wordlist) == len(wordset):
            valid_passphrases += 1
    
    return valid_passphrases

This is only scratching the surface of the capabilities of python sets, they also have the capailities to do operations such as union, to merge two sets and discard duplicates; intersection, to find elements common to two sets; difference, to find elements of set A that are not in set B, and so on.

Further reading on python sets and general information on sets:

Python Docs: https://docs.python.org/3.5/library/stdtypes.html#set-types-set-frozenset
Wikibooks Tutorial: https://en.wikibooks.org/wiki/Python_Programming/Sets
Mathematical Set: https://en.wikipedia.org/wiki/Set_(mathematics)
Computer Science Set: https://en.wikipedia.org/wiki/Set_(abstract_data_type)