What is an Array?

Before we look at indices, let’s take a step back and look at the NumPy array. Just what is it that makes NumPy arrays so ubiquitous and NumPy one of the most successful numerical tools ever? The answer is quite a few things, which come together to make NumPy a fast and easy to use library for array computations. But one feature in particular stands out: multidimensional indexing.

Let’s consider pure Python for a second. Suppose we have a list of values, for example, a list of your bowling scores.

>>> scores = [70, 65, 71, 80, 73]

From what we learned before, we can now index this list with integers or slices to get some subsets of it.

>>> scores[0] # The first score
70
>>> scores[-3:] # The last three scores
[71, 80, 73]

You can imagine all sorts of different things you’d want to do with your scores that might involve selecting individual scores or ranges of scores. For example, with the above examples, we could easily compute the average score of our last three games, and see how it compares to our first game. So hopefully you are convinced that at least the types of indices we have learned so far are useful.

Now suppose your bowling buddy Bob learns that you are keeping track of scores and wants you to add his scores as well. He bowls with you, so his scores correspond to the same games as yours. You could make a new list, bob_scores, but this means storing a new variable. You’ve got a feeling you are going to end up keeping track of a lot of people’s scores. So instead, you change your scores list into a list of lists. The first inner list is your scores, and the second will be Bob’s:

>>> scores = [[70, 65, 71, 80, 73], [100, 93, 111, 104, 113]]

Now you can easily get your scores:

>>> scores[0]
[70, 65, 71, 80, 73]

and Bob’s scores:

>>> scores[1]
[100, 93, 111, 104, 113]

But now there’s a problem (aside from the obvious problem that Bob is a better bowler than you). If you want to see what everyone’s scores are for the first game, you have to do something like this:

>>> [p[0] for p in scores]
[70, 100]

That’s a mess. Clearly, you should have inverted the way you constructed your list of lists, so that each inner list corresponds to a game, and each element of that list corresponds to the person (for now, just you and Bob):

>>> scores = [[70, 100], [65, 93], [71, 111], [80, 104], [73, 113]]

Now you can much more easily get the scores for the first game

>>> scores[0]
[70, 100]

Except now if you want to look at just your scores for all games (that was your original purpose after all, before Bob got involved), it’s the same problem again. To extract that you have to do

>>> [game[0] for game in scores]
[70, 65, 71, 80, 73]

which is the same mess as above. What are you to do?

The NumPy array provides an elegant solution to this problem. Our idea of storing the scores as a list of lists was a good one, but unfortunately, it pushed the limits of what the Python list type was designed to do. Python lists can store anything, be it numbers, strings, or even other lists. If we want to tell Python to index a list that is inside of another list, we have to do it manually, because the elements of the outer list might not even be lists. For example, l = [1, [2, 3]] is a perfectly valid Python list, but the expression [i[0] for i in l] is invalid, because not every element of l is a list.

NumPy arrays function like a list of lists, but are restricted so that these kinds of operations always “make sense”. More specifically, if you have a “list of lists”, each element of the “outer list” must be a list. [1, [2, 3]] is not a valid NumPy array. Furthermore, each inner list must have the same length, or more precisely, the lists at each level of nesting must have the same length.

Lists of lists can be nested more than just two levels deep. For example, you might want to take your scores and create a new outer list, splitting them by season. Then you would have a list of lists of lists, and your indexing operations would look like [[game[0] for game in season] for season in scores].

In NumPy, these different levels of nesting are called axes or dimensions. The number of axes—the level of nesting—is called the rank[1] or dimensionality of the array. Together, the lengths of these lists at each level are called the shape of the array (remember that the lists at each level have to have the same number of elements).

A NumPy array of our scores (using the last representation) looks like this

>>> import numpy as np
>>> scores = np.array([[70, 100], [65, 93], [71, 111], [80, 104], [73, 113]])

Except for the np.array() call, it looks exactly the same as the list of lists. But the difference is indexing. If we want the first game, as before, we use scores[0]:

>>> scores[0]
array([ 70, 100])

But if we want to find only our scores, instead of using a list comprehension, we can simply use

>>> scores[:, 0]
array([70, 65, 71, 80, 73])

This index contains two components: the slice : and the integer index 0. The slice : says to take everything from the first axis (which represents games), and the integer index 0 says to take the first element of the second axis (which represents people).

The shape of our array is a tuple with the number of games (the outer axis) and the number of people (the inner axis).

>>> scores.shape
(5, 2)

This is the power of multidimensional indexing in NumPy arrays. If we have a list of lists of numbers, or a list of lists of lists of numbers, or a list of lists or lists of lists…, we can index things at any “nesting level” equally easily. There is a small reasonable restriction, namely that each “level” of lists (axis) must have the same number of elements. This restriction is reasonable because in the real world, data tends to be tabular, like bowling scores, meaning each axis will naturally have the same number of elements. Even if this weren’t the case, for instance, if Bob were out sick for a game, we could easily use a sentinel value like -1 or nan for a missing value to maintain uniform lengths.

The indexing semantics are only a small part of what makes NumPy arrays so powerful. They also have many other advantages that are unrelated to indexing. They operate on contiguous memory using native machine data types, which makes them very fast. They can be manipulated using array expressions with broadcasting semantics; for example, you can easily add a handicap to the scores array with something like scores + np.array([124, 95]), which would itself be a nightmare using the list of lists representation. This, along with the powerful ecosystem of libraries like scipy, matplotlib, and thousands of others, are what have made NumPy such a popular and essential tool.

Footnotes