Simple data structures: lists, trees, graphs

Storing compound data, or multiple data, is a common task in computer use. Data can be stored hierarchically, linearly, or in many other configurations. In this section we discuss a few of the most popular data structures.

Linear structures

There are a few popular linear structures; we'll discuss them very briefly. Note that these are all ordered structures.

Arrays

An array is a preallocated block of memory, divided into segments. For example, an ASCII string is an array of bytes. Physically, arrays are stored as a memory address of the first entry, the size (in bytes) of an item, and the number of items. To access the n-th item, you add the item size n times to the address.

If you don't know how many items you want in an array, you could use a dynamic array. These are arrays which occasionally resize to expand or contract as the number of items used changes. Resizing is an expensive operation: a new array is allocated, and the old array is copied into the new one. In order to reduce the number of resizing operations, many dynamic arrays only change size exponentially.

Linked lists

Linked lists are data structures where each item is allocated individually, and is stored with a next reference (usually the memory address) of the next item. If next is null (a special value to indicate no address, typically the zero address), the end of the list has been reached.

---------------       ---------------       ---------------
| x[0] | next |--->---| x[1] | next |--->---| x[2] | NULL | 
---------------       ---------------       ---------------

Changing the size of a linked list is very cheap: to add, you find the two items you want to insert between, and update the "next" pointers (in the correct order!):

---------------       ---------------       ---------------
| x[0] | next |--->-- | x[1] | next |--->---| x[2] | NULL | 
---------------       ---------------       ---------------
                      /
           --------------
           |  y  | next |
	   --------------


---------------       ---------------       ---------------
| x[0] | next |       | x[1] | next |--->---| x[2] | NULL | 
---------------       ---------------       ---------------
            \         /
           --------------
           |  y  | next |
	   --------------

To delete, you update the "next" pointer of the previous item to match the "next" pointer of the deleted item.

Exercise: use diagrams like the above to explain how to delete an item from a linked list.

Linked lists can be upgraded in various ways; for example, a doubly-linked-list has next and previous pointers, so it can be traversed in either direction.

The operation of adding an item to a linked list is called "insertion".

Stacks, queues

Stacks are a data structure in which items are visualised as being in a "stack": the only item accessible is the last item added; the only way to access earlier items is by removing the items added afterwards. The operation of adding to top of a stack is called "pushing", while removing from the top is called "popping". (The last item in the stack is called the "top".)

Queues are the opposite: the only item accessible is the first item added; to access items added later, the first item must be removed. Adding to the head of the queue is called "enqueuing", and removing "dequeuing". (The first item of the queue is called the "head".)

Trees

Trees are the most obvious way to construct a hierarchy: a tree starts with an item (typically called the "root") and pointers to a number of other subordinate items. The subordinates are called "children"; each item is allowed to have children. In this way we can refer to parents, descendents, ancestors.

Here we have the root node R, with children A and B. R is the parent of A and B. The tree has leaves X, Y, Z, and W; they have R as a grandparent, and A, B, R as ancestors.

           R
	  / \
	 A   B--
        / \   \ \
       X   Y   Z W

There are many different types of trees: a tree is binary if every node has at most two children. If the items can be ordered (for example, they are numbers), then a binary search tree is a binary tree such that the value of a node is greater than every descendent to the left, and less than every descendent to the right. A binary search tree is balanced if for every node, the sizes of the left and right subtrees differ by at most one.

Exercise: assemble the numbers 1-10 into binary search trees which are (a) maximally unbalanced to the left, (b) balanced, (c) one step from balanced.

Graphs

A graph is a non-hierarchical data structure admitting many-to-many relationships. The fundamental data of a graph is an unordered collection of items ("nodes" or "vertices"), and for each node a list of connections to other nodes ("edges").

Examples:

          ---          C 
         /   \       /   \
        A --- B     A --- B  E         A - B
         \   /       \
          ---          D

A directed graph is a collection of vertices together with a list of ordered connections to other graphs:

          -<-          C 
         ↙   ↖       ↗   ↘
        A ->- B     A ->- B  E         A -> B
         ↖   ↙       ↘
          -<-          D

A path in a graph or directed graph is a list of edges to follow. (If the graph is directed, the path must respect the edge directions.) The first node in the path is called the source or the start or the tail; the last is called the sink or the end or the head.

A graph is called connected if every node has a path to every other node. (Whether the graph is directed or not is irrelevant for this.)

A cycle is a path which starts and end are the same. A graph which has no cycles is called acyclic.

Exercise: assemble a directed acyclic graph with the numbers 1-12 by strict divisibility: an edge from A to B if B/A is prime. There are no directed cycles, but some nodes do have multiple paths to them. (These form cycles if you ignore the direction.) Which ones? Explain how to decide if a number will have multiple paths to it.

Every directed acyclic graph has at least one maximum spanning tree: a directed tree (i.e., arrows always point away from the root) inside the graph which touches every node.

Exercise: Identify several maximal spanning trees in the divisibility graph from the previous exercise.

Exercise: model acquaintance using a graph (vertices are people, an edge between A and B means A knows B). Model it with a directed graph. How are these different?