In this practical we will write a collection of separate tools for performing statistics. There are two parts:
awk
); this
is used for generating intermediate columns, where the new column
depends only on values in the same rowawk
tutorialawk
is a record- and field-based programming language. The typical
awk
command takes in a table of data, with "records" (rows)
separated by newlines "\n", and fields (entries in a row) separated by
tabs "\t".
seq 1 5 | awk '{print $1 * 2}'
The awk
program here is the first argument, {print $1 * 2}
. The
braces {...}
instruct awk
to run the enclosed code on every
record. The variable $1
refers to the first entry in the record.
(Here we have only one entry.)
paste <(seq 1 5) <(seq 11 15) | awk '{$3 = $1 + $2; print}'
This adds a third column, the sum, to a two-column table.
Exercise: What if you wanted just the sum column, and didn't need the original table? Write an awk command that takes a two column table and outputs just the sum column.
You can use some variables in awk
to store state; here is an example
that computes the difference between successive records (the built-in
variable NR
tells you how many records you have processed):
$ (echo 1; echo 2; echo 10) | awk '{if(NR > 1) print($1 - prev); prev = $1}'
1
8
Exercise: Write a python program stats-sum
which reads a
newline-separated list of floating-point numbers from standard input.
When it reaches the end of standard input, it prints the sum, and
exits.
Here is a skeleton for your program; save it as stats-sum
(this is
the file name), and mark it executable using chmod +x stats-sum
.
#!/usr/bin/python3
import sys
for line in sys.stdin:
# do things
Here is how to test it:
$ seq 1 5 | ./stats-sum
15
Exercise: Write similar "aggregator" programs computing stats-mean
,
stats-median
, stats-variance
, stats-stddev
(standard deviation),
stats-mad
(median absolute deviation). Feel free to use the standard
library, but do not use any third-party python packages.
We will call these programs from a glue program using
subprocess.Popen
; the following example should tell you enough to
complete the practical:
import subprocess
# open a subprocess with two-way communication
# if simply `"./stats-sum"` doesn't work, you can try
# passing the array `["python3", "./stats-sum"]`
my_subprocess = subprocess.Popen("./stats-sum",
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
for x in [1,2,3]:
# need to encode the string, because we communicate binary
# it needs a linebreak, because that is how we separate records
my_subprocess.stdin.write("{}\n".format(x).encode('utf-8'))
# inform the subprocess that is the end of input
my_subprocess.stdin.close()
# read the binary result, and decode
# don't double the newlines
print(my_subprocess.stdout.read().decode('utf-8'), end='')
Exercise: Write a program stats
; this program reads standard input
and takes arguments. The arguments it takes are aggregations "mean",
"median", "variance", etc. The standard input is a numeric table with
tab as the column separator and newline as the record separator. The
n
th column is fed to the n
th aggregator program via popen
; the
results are printed as a single record.
Example of how we will run stats
:
$ paste <(seq 1 10) <(seq 1 10) <(seq 1 10) | ./stats mean median variance
5.5 5.5 9.166667
Exercise: What are the benefits of this multiple-communicating-programs architecture? What are the drawbacks? Explain.
Exercise: What happens if your columns are different lengths? Are
empty cells treated as zero? If so, change the design by altering the
stats
program to skip empty cells.
Exercise: Explain how you might change the design to permit more
than one aggregation of a single column. How would you communicate
this to stats
with arguments? What logic needs to be changed in
stats
? Do you need to change the aggregator programs at all?
Exercise: Explain how you might change the design to permit
two-column aggregators, for example, integration. How would
you communicate this to stats
with arguments? What logic needs to be
changed in stats
?
Exercise: Choose one of these two design changes, and implement it. If you choose the two-column aggregators: the formula for integral is the sum of the expression (after the data is sorted by the first column)
($1 - prev1) * $2