Statistics toolkit

In this practical we will write a collection of separate tools for performing statistics. There are two parts:

learning to use a per-record processing (we will use awk); this is used for generating intermediate columns, where the new column depends only on values in the same row
writing tools to perform statistics on the columns

`awk` tutorial

awk is a record- and field-based programming language. The typical awk command takes in a table of data, with "records" (rows) separated by newlines "\n", and fields (entries in a row) separated by tabs "\t".

seq 1 5 | awk '{print $1 * 2}'

The awk program here is the first argument, {print $1 * 2}. The braces {...} instruct awk to run the enclosed code on every record. The variable $1 refers to the first entry in the record. (Here we have only one entry.)

paste <(seq 1 5) <(seq 11 15) | awk '{$3 = $1 + $2; print}'

This adds a third column, the sum, to a two-column table.

Exercise: What if you wanted just the sum column, and didn't need the original table? Write an awk command that takes a two column table and outputs just the sum column.

You can use some variables in awk to store state; here is an example that computes the difference between successive records (the built-in variable NR tells you how many records you have processed):

$ (echo 1; echo 2; echo 10) | awk '{if(NR > 1) print($1 - prev); prev = $1}'
1
8

Statistics tools

Exercise: Write a python program stats-sum which reads a newline-separated list of floating-point numbers from standard input. When it reaches the end of standard input, it prints the sum, and exits.

Here is a skeleton for your program; save it as stats-sum (this is the file name), and mark it executable using chmod +x stats-sum.

#!/usr/bin/python3

import sys

for line in sys.stdin:
  # do things

Here is how to test it:

$ seq 1 5 | ./stats-sum
15

Exercise: Write similar "aggregator" programs computing stats-mean, stats-median, stats-variance, stats-stddev (standard deviation), stats-mad (median absolute deviation). Feel free to use the standard library, but do not use any third-party python packages.

We will call these programs from a glue program using subprocess.Popen; the following example should tell you enough to complete the practical:

import subprocess

# open a subprocess with two-way communication
# if simply `"./stats-sum"` doesn't work, you can try
# passing the array `["python3", "./stats-sum"]`
my_subprocess = subprocess.Popen("./stats-sum",
    stdin=subprocess.PIPE, stdout=subprocess.PIPE)

for x in [1,2,3]:
  # need to encode the string, because we communicate binary
  # it needs a linebreak, because that is how we separate records
  my_subprocess.stdin.write("{}\n".format(x).encode('utf-8'))

# inform the subprocess that is the end of input
my_subprocess.stdin.close()

# read the binary result, and decode
# don't double the newlines
print(my_subprocess.stdout.read().decode('utf-8'), end='')

Exercise: Write a program stats; this program reads standard input and takes arguments. The arguments it takes are aggregations "mean", "median", "variance", etc. The standard input is a numeric table with tab as the column separator and newline as the record separator. The nth column is fed to the nth aggregator program via popen; the results are printed as a single record.

Example of how we will run stats:

$ paste <(seq 1 10) <(seq 1 10) <(seq 1 10) | ./stats mean median variance
5.5	5.5	9.166667

Exercise: What are the benefits of this multiple-communicating-programs architecture? What are the drawbacks? Explain.

Exercise: What happens if your columns are different lengths? Are empty cells treated as zero? If so, change the design by altering the stats program to skip empty cells.

Exercise: Explain how you might change the design to permit more than one aggregation of a single column. How would you communicate this to stats with arguments? What logic needs to be changed in stats? Do you need to change the aggregator programs at all?

Exercise: Explain how you might change the design to permit two-column aggregators, for example, integration. How would you communicate this to stats with arguments? What logic needs to be changed in stats?

Exercise: Choose one of these two design changes, and implement it. If you choose the two-column aggregators: the formula for integral is the sum of the expression (after the data is sorted by the first column)

	($1 - prev1) * $2

Statistics toolkit

awk tutorial

Statistics tools

`awk` tutorial