A Productive Desktop Environment for Scientists and Engineers - Part I

From assela Pathirana
Jump to navigationJump to search

(THIS IS STILL NOT IN A FORM USEFUL TO ANYBODY)

Introduction

Over the years of computer use for my work, I have settled down for certain practices that make my tasks a bit easier. Whether I am analyzing data, rigging-up a crude model, editing computer programs, writing a paper, creating a presentation or trying to keep my things organized. (Here, 'things' are strictly limited to 'things' that are stored in computer storage media. Anyone who has been to my office know the situation on other 'things'). In these pages I attempt to go through my 'tools of the trade' with the hope that someone in a similar situation will find them to be useful.


The little geeky program: Cygwin

Excel or excel: Data processing without getting your hands dirty

Green info.gif

This section makes use of basic UNIX utilities like awk, sed and bash shell. You may want to go through the following resources briefly, before continuing with this section.

  1. Bash guide
  2. GNU awk (gawk) guide
  3. Sed and sed FAQ.

Spreadsheets are useful programs for performing small analyses and testing ideas that involove a bit of computations. The problem comes when one wants to handle a dataset that has about one million rows. (One never has to handle THAT big files? How long is a hourly rainfall series covering 100 years? 876600. That's pretty much near one million.) Many spreadsheets have strict limitations on the number of rows and columns they can handle. (Microsoft Excel can have 65536 rows and 256 columns in one worksheet.)

Of course it is possible to break your data in to parts, say with 60,000 rows and process them in different worksheets. But, look. We are using a computer and this darn thing is supposed to save our time!

There are two extremely useful Unix utilities for processing long text files: awk and sed.

In the following brief introductions we use a real-world dataset to demonstrate some of the possibilities of those programs. See this section on how to download and prepare data, before proceeding.

In comes awk

We shall calculate the sum of rainfall field, the whole three hundred thousand odd values.

cat Sample-rainfall |awk '{sum=sum+$3}END{print sum}'

Now, let's look at this line a bit closely.

  1. cat command just lists the file. (Try doing a cat Sample-rainfall and see what happens.)
  2. | is the pipe command that passes the output of the command in the left hand of it as input to the one on its right hand.
  3. Then comes awk. It is usual to surround the awk program (if you can call those few characters that!) with single quotes. The first braces ({}) within the quotes indicate the main part of the program, which is executed once for each input line (record). (i.e. once for the first line 1993-09-30 13:20 0.0 and once for the second: 1993-09-30 13:30 0.0, and so on ...) The $ notation is special in awk, which indicates the chunk (field) of input of the current record. For example, $1 is the first chunk, $2, the second and so on. The default field-separator in awk is whitespace (i.e. spaces or tabs). Therefore, $3 in our program above is the third field, namely, the rainfall value. (First field is the date and second is time in our sample file.)
    • So what happens here is the value of the third field (i.e. rainfall in this case) at each record is added to the previous value of the variable sum (which has initial value of 0).
    • the '{}' after the keyword END is executed only after all the input lines have been processed. It simply prints the final value of sum.

A few more examples

  • Total rainfall of year 1995:
     cat Sample-rainfall | awk '{if( $1 ~ /1995/ )sum=sum+$3}END{print sum}' 
  • The average rainfall of months of September:
     cat Sample-rainfall | awk '{if( $1 ~ /-09-/ ){sum=sum+$3;ct=ct+1}}END{print sum/ct*30*24*6}'
  • Maximum recorded rainfall value
     cat Sample-rainfall | awk 'BEGIN{max=-9999;maxrec=""}{if($3>max){max=$3;maxrec=$0}}END{print maxrec}'
    gives
    1997-10-25 14:18        20.5
    as answer.
  • The amount of day-time rainfalls, night-time rainfall and their ratio.
    cat Sample-rainfall | awk -F'[ \t:]' \

'{time=$2*60+$3;if(time>360 && time<1080) {day=day+$4}else{night=night+$4}}\ END{print day, night,day/night}' will results in something like

1624 1377.9 1.17861

, meaning day rainfall amount was 1.18 times the night rainfall amount. Now, here we perform some additional tricks. First

-F'[ \t:]'

option to awk asks the program to treat space (' '), tab (^t) and colon(:) as field seperators. As a result, now there will be four fields in each record, instead of three in those of the previous examples (time field will be broken at the colon to two). Then, during processing we compute the number of minutes since midnight by

time=$2*60+$3

and use them to decide the particular record is in day times (between 360 and 1080) or night time.

Green info.gif

In this example, we use the date functions, which are unique to GNU awk (or gawk). Today the awk version present in cygwin and most Linux distributions is gawk. See this article] for more information of date functions.

  • Finding the missing data in the time-series (Now this is a bit complicated example)

cat Sample-rainfall | awk -F'[ \t:-]' \ '{ct++;time=mktime($1" "$2" "$3" "$4" "$5" 00");\ if(NR>1&&time-oldtime>660)\ {print $0 oldfield,time-oldtime; sum=sum+(time-oldtime-600)};\ oldtime=time; oldfield=$0;}END{print sum/3600/24}' \ |sort -k6 -n > tmp

    • Here's how it works:
      1. We split records at space, tab, colon(:), hyphen, so that each record result in six fields. For example the record

1993-09-30 13:20 0.0

is split as

1993 09 30 13 20 0.0 F1 F2 F3 F4 F5 F6

.

      1. mktime function creates a number representing the number of seconds since 1970-01-01 00:00:00 UTC. Typical usage of the function is

mktime(YYYY,MM,DD,HH,MM,SS") mktime(1993,9,30,13,20)"

      1. At each row, we save the current time as oldtime, to be used in the next row.
      2. At each row, we compare current time with old time and if the different is larger than 660 (11 minutes), we print current line, previous like and the differnce of time in seconds
      3. We add the number of missing seconds to the variable sum
      4. After end of the rows, we print sum, converted from seconds to days.
      5. finally instead of just printing out the results, we
        1. filter the results through sort, so that the largest gaps will come towards the end.
        2. redirect the output to a file, so that we can examine the result leasurely.
    • You may find that this time-series has 135 days worth of data missing, and the largest gap is 3283440 seconds (38 days) between 1997-08-01 11:09 and 1997-06-24 11:05.

This is a good point to mention that writing scripts to files can be a good practice to save your time.

Writing your scripts to files

Short scripts involving a few words (e.g. First example in the section on awk) can be written on-the-fly at the command prompt like:

$  cat Sample-rainfall 

You need a good web browser

Simple! Just download Firefox browser and be happy ever after :-) (at least for the foreseeable future!). I am not just trying to be different from the 'masses' here. Simply trust me on this one, install it, load a web page and press 'Ctrl+T'. Tabbed browsing is one of those simple improvement that endear a tool to the user. I have been using this feature on my browser (first Mozilla and then wikipedia:Mozilla Firefox since 2001 and I can not imagine browsing the internet without it.

Spell checker for your browser

When was the last time you have opend up one of large textarea on a web page. (a good example is using a webmail service like gmail or [[wikipedia:Yahoo! Mail|]].) Before submitting the 'Go' button, it is better to correct those misspelled words. This can be done by copying and pasting the text in your wordprocessor (e.g. Microsoft Word), checking spelling and pasting back. But that is a lot of work! It is better to have a built-in spell checker in the browser, always at service.

Spellbound is an extension for Mozilla firefox, for just doing that. Follow the link below to learn how it works and how to install it.



Download and prepare test data

We use a rainfall dataset covering 1993-09-30 13:20 to 1999-12-08 10:39 downloaded from the [National Technical University of Athens] Greece. It has

Green info.gif

Wget command can download anything on the web without using a browser, just on the commandline. If you don't have wget command, install the package using Cygwin setup.

  • Download data.
  • Expand the compressed file. A text file named 'Qim4.txt' is the result.

Here are the commands needed to do that

wget http://assela.pathirana.net/images/f/fe/Sample-rainfall.bz2
bunzip2 Sample-rainfall.bz2 

A sample of the content of the expanded file can be checked easily by

head -n5 Sample-rainfall #this command gives the first five lines of the file

or

tail -n10 Sample-rainfall #the last ten lines

Former command should output someting like

1993-09-30 13:20        0.0
1993-09-30 13:30        0.0
1993-09-30 13:51        0.0
1993-09-30 14:01        0.0
1993-09-30 14:11        0.0