A Productive Desktop Environment for Scientists and Engineers - Part I

From assela Pathirana
Jump to navigationJump to search

(THIS IS STILL NOT IN A FORM USEFUL TO ANYBODY)

Introduction

Over the years of computer use for my work, I have settled down for certain practices that make my tasks a bit easier. Whether I am analyzing data, rigging-up a crude model, editing computer programs, writing a paper, creating a presentation or trying to keep my things organized. (Here, 'things' are strictly limited to 'things' that are stored in computer storage media. Anyone who has been to my office know the situation on other 'things'). In these pages I attempt to go through my 'tools of the trade' with the hope that someone in a similar situation will find them to be useful.


The little geeky program: Cygwin

Excel or excel: Data processing without getting your hands dirty

Green info.gif

This section makes use of basic UNIX utilities like awk, sed and bash shell. You may want to go through the following resources briefly, before continuing with this section.

  1. Bash guide
  2. GNU awk (gawk) guide
  3. Sed and sed FAQ.

Spreadsheets are useful programs for performing small analyses and testing ideas that involove a bit of computations. The problem comes when one wants to handle a dataset that has about one million rows. (One never has to handle THAT big files? How long is a hourly rainfall series covering 100 years? 876600. That's pretty much near one million.) Many spreadsheets have strict limitations on the number of rows and columns they can handle. (Microsoft Excel can have 65536 rows and 256 columns in one worksheet.)

Of course it is possible to break your data in to parts, say with 60,000 rows and process them in different worksheets. But, look. We are using a computer and this darn thing is supposed to save our time!

There are two extremely useful Unix utilities for processing long text files: awk and sed.

In the following brief introductions we use a real-world dataset to demonstrate some of the possibilities of those programs. See this section on how to download and prepare data, before proceeding.

In comes awk

We shall calculate the sum of rainfall field, the whole three hundred thousand odd values.

cat Sample-rainfall |awk '{sum=sum+$3}END{print sum}'

Now, let's look at this line a bit closely.

  1. cat command just lists the file. (Try doing a cat Sample-rainfall and see what happens.)
  2. | is the pipe command that passes the output of the command in the left hand of it as input to the one on its right hand.
  3. Then comes awk. It is usual to surround the awk program (if you can call those few characters that!) with single quotes. The first braces ({}) within the quotes indicate the main part of the program, which is executed once for each input line (record). (i.e. once for the first line 1993-09-30 13:20 0.0 and once for the second: 1993-09-30 13:30 0.0, and so on ...) The $ notation is special in awk, which indicates the chunk (field) of input of the current record. For example, $1 is the first chunk, $2, the second and so on. The default field-separator in awk is whitespace (i.e. spaces or tabs). Therefore, $3 in our program above is the third field, namely, the rainfall value. (First field is the date and second is time in our sample file.)
    • So what happens here is the value of the third field (i.e. rainfall in this case) at each record is added to the previous value of the variable sum (which has initial value of 0).
    • the '{}' after the keyword END is executed only after all the input lines have been processed. It simply prints the final value of sum.

A few more examples

  • Total rainfall of year 1995:
     cat Sample-rainfall | awk '{if( $1 ~ /1995/ )sum=sum+$3}END{print sum}' 
  • The average rainfall of months of September:
     cat Sample-rainfall | awk '{if( $1 ~ /-09-/ ){sum=sum+$3;ct=ct+1}}END{print sum/ct*30*24*6}'
  • Maximum recorded rainfall value
    x

You need a good web browser

Simple! Just download Firefox browser and be happy ever after :-) (at least for the foreseeable future!). I am not just trying to be different from the 'masses' here. Simply trust me on this one, install it, load a web page and press 'Ctrl+T'. Tabbed browsing is one of those simple improvement that endear a tool to the user. I have been using this feature on my browser (first Mozilla and then wikipedia:Mozilla Firefox since 2001 and I can not imagine browsing the internet without it.

Spell checker for your browser

When was the last time you have opend up one of large textarea on a web page. (a good example is using a webmail service like gmail or [[wikipedia:Yahoo! Mail|]].) Before submitting the 'Go' button, it is better to correct those misspelled words. This can be done by copying and pasting the text in your wordprocessor (e.g. Microsoft Word), checking spelling and pasting back. But that is a lot of work! It is better to have a built-in spell checker in the browser, always at service.

Spellbound is an extension for Mozilla firefox, for just doing that. Follow the link below to learn how it works and how to install it.



Download and prepare test data

We use a rainfall dataset covering 1993-09-30 13:20 to 1999-12-08 10:39 downloaded from the [National Technical University of Athens] Greece. It has

Green info.gif

Wget command can download anything on the web without using a browser, just on the commandline. If you don't have wget command, install the package using Cygwin setup.

  • Download data.
  • Expand the compressed file. A text file named 'Qim4.txt' is the result.

Here are the commands needed to do that

wget http://assela.pathirana.net/images/f/fe/Sample-rainfall.bz2
bunzip2 Sample-rainfall.bz2 

A sample of the content of the expanded file can be checked easily by

head -n5 Sample-rainfall #this command gives the first five lines of the file

or

tail -n10 Sample-rainfall #the last ten lines

Former command should output someting like

1993-09-30 13:20        0.0
1993-09-30 13:30        0.0
1993-09-30 13:51        0.0
1993-09-30 14:01        0.0
1993-09-30 14:11        0.0