Statistics toolkit

In this practical we will write a collection of separate tools for performing statistics. There are two parts:

  1. learning to use a per-record processing (we will use awk); this is used for generating intermediate columns, where the new column depends only on values in the same row
  2. writing tools to perform statistics on the columns

awk tutorial

awk is a record- and field-based programming language. The typical awk command takes in a table of data, with "records" (rows) separated by newlines "\n", and fields (entries in a row) separated by tabs "\t".

seq 1 5 | awk '{print $1 * 2}'

The awk program here is the first argument, {print $1 * 2}. The braces {...} instruct awk to run the enclosed code on every record. The variable $1 refers to the first entry in the record. (Here we have only one entry.)

paste <(seq 1 5) <(seq 11 15) | awk '{$3 = $1 + $2; print}'

This adds a third column, the sum, to a two-column table.

Exercise: What if you wanted just the sum column, and didn't need the original table? Write an awk command that takes a two column table and outputs just the sum column.

You can use some variables in awk to store state; here is an example that computes the difference between successive records (the built-in variable NR tells you how many records you have processed):

$ (echo 1; echo 2; echo 10) | awk '{if(NR > 1) print($1 - prev); prev = $1}'
1
8

Statistics tools

Exercise: Write a python program stats-sum which reads a newline-separated list of floating-point numbers from standard input. When it reaches the end of standard input, it prints the sum, and exits.

Here is a skeleton for your program; save it as stats-sum (this is the file name), and mark it executable using chmod +x stats-sum.

#!/usr/bin/python3

import sys

for line in sys.stdin:
  # do things

Here is how to test it:

$ seq 1 5 | ./stats-sum
15

Exercise: Write similar "aggregator" programs computing stats-mean, stats-median, stats-variance, stats-stddev (standard deviation), stats-mad (median absolute deviation). Feel free to use the standard library, but do not use any third-party python packages.

We will call these programs from a glue program using subprocess.Popen; the following example should tell you enough to complete the practical:

import subprocess

# open a subprocess with two-way communication
# if simply `"./stats-sum"` doesn't work, you can try
# passing the array `["python3", "./stats-sum"]`
my_subprocess = subprocess.Popen("./stats-sum",
    stdin=subprocess.PIPE, stdout=subprocess.PIPE)

for x in [1,2,3]:
  # need to encode the string, because we communicate binary
  # it needs a linebreak, because that is how we separate records
  my_subprocess.stdin.write("{}\n".format(x).encode('utf-8'))

# inform the subprocess that is the end of input
my_subprocess.stdin.close()

# read the binary result, and decode
# don't double the newlines
print(my_subprocess.stdout.read().decode('utf-8'), end='')

Exercise: Write a program stats; this program reads standard input and takes arguments. The arguments it takes are aggregations "mean", "median", "variance", etc. The standard input is a numeric table with tab as the column separator and newline as the record separator. The nth column is fed to the nth aggregator program via popen; the results are printed as a single record.

Example of how we will run stats:

$ paste <(seq 1 10) <(seq 1 10) <(seq 1 10) | ./stats mean median variance
5.5	5.5	9.166667

Exercise: What are the benefits of this multiple-communicating-programs architecture? What are the drawbacks? Explain.

Exercise: What happens if your columns are different lengths? Are empty cells treated as zero? If so, change the design by altering the stats program to skip empty cells.

Exercise: Explain how you might change the design to permit more than one aggregation of a single column. How would you communicate this to stats with arguments? What logic needs to be changed in stats? Do you need to change the aggregator programs at all?

Exercise: Explain how you might change the design to permit two-column aggregators, for example, integration. How would you communicate this to stats with arguments? What logic needs to be changed in stats?

Exercise: Choose one of these two design changes, and implement it. If you choose the two-column aggregators: the formula for integral is the sum of the expression (after the data is sorted by the first column)

	($1 - prev1) * $2