AdventOfCode Day 4 - High-Entropy Passphrases

Advent of Code Day 4 - High-Entropy Passphrases

AKA. Let's talk about Sets, baby.

Today's (2017/12/04) AdventOfCode puzzle highlights the power of Python sets to perform quite powerful data operations.

Sets are a data structure that seem to get less love than they probably deserve, given that they provide a powerful way to manipulate unique data. I will briefly discuss the way that sets operate, and how we can use some of their properties to rapidly solve todays problem.


Part One:

A new system policy has been put in place that requires all accounts to use a passphrase instead of simply a password. A passphrase consists of a series of words (lowercase letters) separated by spaces.

To ensure security, a valid passphrase must contain no duplicate words.

For example:

aa bb cc dd ee is valid.
aa bb cc dd aa is not valid - the word aa appears more than once.
aa bb cc dd aaa is valid - aa and aaa count as different words.

The system's full passphrase list is available as your puzzle input. How many passphrases are valid?


So here what we need to do is look at each "word" in an input line and check that it doesn't appear twice. If it does appear twice, the "passphrase" is invalid. So how can sets help us?

A set is fundamentally a mathematical object which contains a set of unique items. In mathematics this will typically be numbers but when extended to computer science it can be any object which you can test for uniqueness in some way. This post in not intended to go into great detail on the theory though, so check out the links below if you are interested in that.

The python documentation says

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference. (For other containers see the built-in dict, list, and tuple classes, and the collections module.)

Like other collections, sets support x in set, len(set), and for x in set. Being an unordered collection, sets do not record element position or order of insertion. Accordingly, sets do not support indexing, slicing, or other sequence-like behavior.

Okay, that's a lot of information in a small space. Lets pull out the parts that are relevant to this problem.

The most important part of a set is the word "distinct", this means that a set cannot contain the same element more than once. We can see this in operation by taking a list with repeated elements and creating a set from that list:

In [1]:
count = ["one", "hahaha", "two", "hahaha", "three", "hahaha"]
countset = set(count)
print(countset)
{'hahaha', 'three', 'one', 'two'}

In making a set from the list, the duplicates have been removed, but also the order appears to have been lost. For our purposes in solving this puzzle that is okay though as we only care about the presence of words, not their order.

As a side note the order is not so much lost as changed by python to allow the set to be searched more efficiently, this is done by arranging the elements in a hash table to allow easy searching for specific elements. The operations that can be performed on sets such as intersection, union, and difference make use of this because they need to rapidly look up many elements in both sets in order to check for their presence or absence. Although the elements are in fact sorted by their hash, that hash bears no relation to what humans consider to be a sensible sort order, and so the elements appear randomised. If order important to what you are doing, you will need a different data structure, which may be better suited to your purposes at the cost of being less efficient to perform set operations.

So now we have an easy way of getting all the unique words in a passphrase we can compare the length of the set of unique words to the length of the complete list. If they match, the passphrase has no repeated words and is valid. This can be implemented as:

In [ ]:
def day_4_1(passphrases):
    """Take the passphrases as a list of strings and return a count of the valid ones.
    """
    # keep a count
    valid_passphrases = 0
    #iterate over the list of passphrases
    for passphrase in passphrases:
        wordlist = passphrase.split() #split the string at every whitespace (the default)
        wordset = set(wordlist)
        if len(wordlist) == len(wordset):
            valid_passphrases += 1
    
    return valid_passphrases

This is exactly what is implemented (albeit more tersely) in my github repository of (nearly) oneliner solutions to AOC2017.

Part two is going to look very similar to this, with an added twist.


Part Two:

For added security, yet another system policy has been put in place. Now, a valid passphrase must contain no two words that are anagrams of each other - that is, a passphrase is invalid if any word's letters can be rearranged to form any other word in the passphrase.

For example:

abcde fghij is a valid passphrase.
abcde xyz ecdab is not valid - the letters from the third word can be rearranged to form the first word.
a ab abc abd abf abj is a valid passphrase, because all letters need to be used when forming another word.
iiii oiii ooii oooi oooo is valid.
oiii ioii iioi iiio is not valid - any of these words can be rearranged to form any other word.

Under this new system policy, how many passphrases are valid?


Now we need to look at the individual words more closely, making sure that the same set of letters is not repeated in any order. In this case we still need to build our set of words, but first we need to make sure we have a good way of comparing them. There are a few ways we could do this, but the most straightforward way is simply to sort each word before we compare them. Then regardless of the original order, if we have two anagrams we can be sure that after sorting they will be identical strings of letters. This makes our second solution just one line longer than the first:

In [2]:
def day_4_2(passphrases):
    """Take the passphrases as a list of strings and return a count of the valid ones.
    """
    # keep a count
    valid_passphrases = 0
    #iterate over the list of passphrases
    for passphrase in passphrases:
        wordlist = passphrase.split() #split the string at every whitespace (the default)
        wordlist = ["".join(sorted(word)) for word in wordlist] # sort each word and turn back into a string
        wordset = set(wordlist)
        if len(wordlist) == len(wordset):
            valid_passphrases += 1
    
    return valid_passphrases

This is only scratching the surface of the capabilities of python sets, they also have the capailities to do operations such as union, to merge two sets and discard duplicates; intersection, to find elements common to two sets; difference, to find elements of set A that are not in set B, and so on.

Further reading on python sets and general information on sets:

AdventOfCode Day 3 - Spiral Memory

Advent of Code Day 3 - Spiral Memory

AKA. "I'm afraid we need to use Math!"1

Today's (2017/12/03) advent of code puzzle is an interesting one, the first half which can be elegantly solved by applying a little mathematics instead of resorting to brute force.

The second half has no closed form analytic solution though, so we will have to write an algorithm instead.

The first part of the puzzle we have been given is:


Part 1:

You come across an experimental new kind of memory stored on an infinite two-dimensional grid.

Each square on the grid is allocated in a spiral pattern starting at a location marked 1 and then counting up while spiraling outward. For example, the first few squares are allocated like this:

17  16  15  14  13
18   5   4   3  12
19   6   1   2  11
20   7   8   9  10
21  22  23---> ...

While this is very space-efficient (no squares are skipped), requested data must be carried back to square 1 (the location of the only access port for this memory system) by programs that can only move up, down, left, or right. They always take the shortest path: the Manhattan Distance between the location of the data and square 1.

For example:

Data from square 1 is carried 0 steps, since it's at the access port.
Data from square 12 is carried 3 steps, such as: down, left, left.
Data from square 23 is carried only 2 steps: up twice.
Data from square 1024 must be carried 31 steps.

How many steps are required to carry the data from the square identified in your puzzle input all the way to the access port?


This is an interesting little problem which you might have guessed is going to involve square numbers, specifically the squares of all the odd numbers.

We can split it into two steps, first, given a number, we need to know where it is in the grid. Then we need to find how far it is from that position back to the position of element 1.

Starting with the position of a given number, lets look at the size of the grid is related to the total number of elements in it. The sizes of the grid increase by two each time, adding a row top and bottom, and left and right, starting from $1\times1$ to $3\times3$ to $5\times5$ and so on. Meanwhile the total number of elements is the number of rows multiplied by the number of columns. So we can see that the total number of elements in grids of increasing size is the sequence of squares of the odd numbers.

So can we use this to find an expression for which ring of the spiral a number is in? Starting with the $3\times3$ grid, lets take the square root of all the numbers:

In [1]:
import numpy as np
memory_grid = np.asarray([[ 17.,  16.,  15.,  14.,  13.],
                [ 18.,   5.,   4.,   3.,  12.],
                [ 19.,   6.,   1.,   2.,  11.],
                [ 20.,   7.,   8.,   9.,  10.],
                [ 21.,  22.,  23.,  24.,  25.]])

print(np.sqrt(memory_grid))
[[ 4.12310563  4.          3.87298335  3.74165739  3.60555128]
 [ 4.24264069  2.23606798  2.          1.73205081  3.46410162]
 [ 4.35889894  2.44948974  1.          1.41421356  3.31662479]
 [ 4.47213595  2.64575131  2.82842712  3.          3.16227766]
 [ 4.58257569  4.69041576  4.79583152  4.89897949  5.        ]]

And compare this to what we want for the $3\times3$ grid:

2 2 2 2 2
2 1 1 1 2
2 1 0 1 2
2 1 1 1 2
2 2 2 2 2

which we number from zero because it is how we do things in Python ;)

Looking at the bottom right of the grid above we see that the maximum square root in each ring is actually giving us the number of elements in the row and column of its ring. This is easy to convert into the number of the ring, because as we have already seen we add two at each step, starting from one, so to get from the number of elements to the number of the ring we need to just subtract one and divide by two:

$$ n_\mathrm{ring} = \frac{n_\mathrm{row,col} - 1}{2} $$

First though, to be able to do it for any number we need to fix the rest of the entries in our square rooted grid above. Looking again at the results we can see that in each ring we need to round up to the next odd number. This can be done by rounding up to the next integer, dividing by two, rounding down, then multiplying by two and adding one.

$$ n_\mathrm{next\ odd} = \mathrm{floor}(\mathrm{ceil}(x)/2) * 2 + 1 $$

Combining this with the operation to convert a row size into a ring number we see that we will just undo the adding of one and multiplication by two and so to get the ring number of any location $x$ in the spiral grid we have

$$ n_\mathrm{ring}(x) = \mathrm{floor}(\mathrm{ceil}(\sqrt{x})/2). $$

The division and rounding down can be done in one operation thanks to python's integer division operator '//':

In [21]:
ring_no = np.ceil(np.sqrt(memory_grid))//2
print(ring_no)
[[ 2.  2.  2.  2.  2.]
 [ 2.  1.  1.  1.  2.]
 [ 2.  1.  0.  1.  2.]
 [ 2.  1.  1.  1.  2.]
 [ 2.  2.  2.  2.  2.]]

Now we have the ring number, we need to work out where in the ring the number occurs. Looking back to the example we are given we see that each ring is indeed starting from the second from bottom cell on the right hand side. This is slightly annoying but not a disaster.

We could go through the process of finding a coordinate relative to the centre of the grid, but to get the number we are looking for that is not necessary. Because movement is limited to the four cardinal directions, the distance from any point to the centre of the grid is given by the Taxicab Metric, which is the sum of the distance moved in the up-down and left-right directions.

Looking at the grid again we can see that for the numbers in the middle of a row or column, they are already lined up with the centre in one of the two directions. Therefore they need only move in the other direction, and the distance to the centre is equal to the number of the ring that they are on.

For any other element we then only need to work out how far it is from the centre of one of the sides of its ring, and add this to the number of the ring to get the distance to the centre. (Note that this is the shortest path, but not the only shortest path, because of the particular way that the taxicab geometry works. This is a fascinating subject in its own right and the wikipedia page is definitely work at least a skim.)

So to find the distance a given element is from the centre of an edge of its ring we start by numbering around the ring starting from a corner. This can be expressed by subtracting the size of the grid inside the ring which is $(2n-1)^2$, and then taking the modulus of the position number with the total number of elements in the ring $4*(2n-1) + 4$.

Unfortunately this will cause a division by zero if we try to compute the offset of the centre cell, but since this is a special case anyway, we will not worry about that and carry on.

In [22]:
offsets = (memory_grid - (2*ring_no - 1)**2) % (4*(2*ring_no-1)+4)
print(offsets)
[[  8.   7.   6.   5.   4.]
 [  9.   4.   3.   2.   3.]
 [ 10.   5.  nan   1.   2.]
 [ 11.   6.   7.   0.   1.]
 [ 12.  13.  14.  15.   0.]]
/usr/lib64/python3.5/site-packages/ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in remainder
  """Entry point for launching an IPython kernel.

At this point we don't actually care what side the element is on, just its distance from the nearest centre of a side. Currently we have calculated the distance from the bottom right corner, we can instead calculate the distance from the nearest corner by taking the modulus of this result with the distance between corners, which is twice the ring number $2n$.

In [23]:
side_offsets = offsets%(2*ring_no)
print(side_offsets)
[[  0.   3.   2.   1.   0.]
 [  1.   0.   1.   0.   3.]
 [  2.   1.  nan   1.   2.]
 [  3.   0.   1.   0.   1.]
 [  0.   1.   2.   3.   0.]]

We see that the furthest distance an element can be from the centre of an edge of its ring is the number of moves from the middle to the corner, which is given by half the side length minus 1. Conveniently, this is just the ring number. Subtracting this, and taking the absolute value we finally get our expression for the distance from the centre of a side:

In [26]:
side_distance = np.abs(side_offsets - ring_no)
print(side_distance)
[[  2.   1.   0.   1.   2.]
 [  1.   1.   0.   1.   1.]
 [  0.   0.  nan   0.   0.]
 [  1.   1.   0.   1.   1.]
 [  2.   1.   0.   1.   2.]]

Finally we add this to the ring number to get the taxicab metric distance of each point to the centre:

In [27]:
print(side_distance + ring_no)
[[  4.   3.   2.   3.   4.]
 [  3.   2.   1.   2.   3.]
 [  2.   1.  nan   1.   2.]
 [  3.   2.   1.   2.   3.]
 [  4.   3.   2.   3.   4.]]

Putting this all together we can finally write a function to take grid positions and output the distance to the grid centre:

In [28]:
def spiral_grid_distance(x):
    ring_no = np.ceil(np.sqrt(x))//2
    ring_offset = (x - (2*ring_no - 1)**2) % (4*(2*ring_no-1)+4)
    side_distance = np.abs(ring_offset%(2*ring_no) - ring_no)
    return(ring_no + side_distance)

And calculate our answer for the puzzle, in my case the input is 325489

In [57]:
%timeit spiral_grid_distance(325489)
The slowest run took 7.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.88 µs per loop

Checking the timing we have no problems with being done within the half millisecond we have to complete this puzzle. Although this is no exactly a surprise given we are doing just a few operations on a single number.

Part 2:

Part two, as I mentioned above, requires an algorithmic rather than purely mathemtical approach. Here is the problem we are given:


As a stress test on the system, the programs here clear the grid and then store the value 1 in square 1. Then, in the same allocation order as shown above, they store the sum of the values in all adjacent squares, including diagonals.

So, the first few squares' values are chosen as follows:

Square 1 starts with the value 1.
Square 2 has only one adjacent filled square (with value 1), so it also stores 1.
Square 3 has both of the above squares as neighbors and stores the sum of their values, 2.
Square 4 has all three of the aforementioned squares as neighbors and stores the sum of their values, 4.
Square 5 only has the first and fourth squares as neighbors, so it gets the value 5.

Once a square is written, its value does not change. Therefore, the first few squares would receive the following values:

147 142 133 122 59 304 5 4 2 57 330 10 1 1 54 351 11 23 25 26 362 747 806---> ...

What is the first value written that is larger than your puzzle input?


This time there is not going to be a closed form solution we can work out. That is to say we might be able to work out an expression in terms of series expression using sums and products but we would still most likely have to write a program to compute it. So instead I am going to use a "wall following algorithm"

Specifically, I will use a "left-wall following algorithm" since this is a left hand spiral. Essentially this is summed up as "Turn left if you can, otherwise go straight until you can turn left"

In order to implement this we simply keep track of where we have been, and if we can turn left into a cell we have not yet visited, we do that.

There are a few restrictions on what kind of memory structure we might want to use for this. Numpy arrays are an option but we cannot grow them arbitrarily so we would have to keep stopping to resize the array, and that gets annoying. So instead, lets use a pure python dictionary, which, along with using tuples for the keys will be an easy, if not computationally optimal way to implement the solution.

Using a co-ordinate system with the starting cell at (0,0), our left wall follower looks like this:

In [1]:
def left_wall_follower(limit):
    cells = {(0,0):1}
    # initial position
    x = 1
    y = 0
    # initial directions
    dx = 1
    dy = 0 
    while True:
        # update current cell
        csum = 0
        # Try to sum over all directions
        for sdx in range(-1,2):
            for sdy in range(-1,2):
                try:
                    csum += cells[(x+sdx,y+sdy)]
                except:
                    pass
        cells[(x,y)] = csum
        # check if we have hit the limit
        if csum > limit:
            break
            
        # if not, find next potential direction
        dxn = -dy
        dyn = dx
        # check if we can turn
        try:
            lhs = cells[(x+dxn,y+dyn)]
        except:
            # cell not set means we can turn
            dx = dxn
            dy = dyn

        # move to next cell and carry on
        x = x + dx
        y = y + dy
    
    return(csum)

Note that this is not an optimal approach, as the algorithm performs brute force checks as to whether cells are empty and whether it is possible to turn. A more efficient implementation would instead keep a track of where it was and so could "know" rather than check what cells it could read and whether it could turn.

This is good enough for me though, since it is elegant(ish), readable, and for the input I am given by adventofcode:

In [54]:
%timeit left_wall_follower(325489)
1000 loops, best of 3: 235 µs per loop

We are done in less than the half millisecond I have available to solve it. :)

1[https://en.wikipedia.org/wiki/The_Prisoner_of_Benda]