A Productive Desktop Environment for Scientists and Engineers - Part I

From assela Pathirana

Jump to: navigation, search



Completed Chapters

  • Part I : Processing simple datasets, shell scripting, awk and sed, spatial data.
  • Part II: Map and other geographical/technical drawings.
  • Part III: Editors, web browsers.
  • Part IV: Create a Desktop database for storing everything!

Contents

Introduction

Over the years of computer use for my work, I have settled down for certain practices that make my tasks a bit easier. Whether I am analyzing data, rigging-up a crude model, editing computer programs, writing a paper, creating a presentation or trying to keep my things organized. (Here, 'things' are strictly limited to 'things' that are stored in computer storage media. Anyone who has been to my office know the situation on other 'things'). In these pages I attempt to go through my 'tools of the trade' with the hope that someone in a similar situation will find them to be useful.


The little geeky program: Cygwin

This section does not cover any interesting tools that can directly be used to solve your computing problems. However, the free program Cygwin is necessary for us to proceed with the 'meaty' stuff.

What is Cygwin

It is expected that you are a little bit familier with Unix or Linux systems. If not, please read this tutorial first.

For Windows users: Cygwin is the easiest way to test the waters of UNIX while in the safety of your windows environment. You can learn a bit of UNIX easily, or better still can use it to do certain useful things easier than with normal windows software. See this article for a couple of example situations.

(For Linux or Unix users: Don't bother.)

Installation

Installing Cygwin is rather easy, mainly due to the excellent installation program. You need a working internet connection. First visit Cygwin website and download setup.exe program. This is a tiny program under 300 kB. Cygwin help pages have extensive information on how to use this little program to install Cygwin. Only some important points are given below. Start by double clicking on the setup.exe program.

  1. Select 'Download and Install from the Internet'
  2. Go mostly with the defaults. Particularly
    • Default text file type should be 'UNIX'.
  3. Select packages:
    • 'Base' and some others are automatically selected, don't deselect them. Additionally, select 'Devel'. Don't bother with the others at this stage, for it is rather easy to add packages as you need them later.

When file/folder names have spaces, we should use the forward slash to 'escape' them in UNIX.

After the installation is finished do the following:

  1. Copy the setup.exe file to your installation directory (usually C:\Cygwin). We can find it there easily, if we need to download additional packages later.
  2. It is possible to use Cygwin within the Windows 'cmd.exe' (the 'dos prompt'), it is much easier and fun to use one of the x-windows based terminals. My choice is the xterm. However, with modern high resolution displays, x-term's default font size can be hard to read. To overcome this, one can start xterm with the following command:
    xterm.exe -fn '*18*'
    It sets the font size to 'one of those fonts of 18point size'.
  3. It is possible to 'map' the whole windows file system to be accessed within Cygwin's Unix shell.
    • Folowing commands will mount the drive c: within Cygwin.

mkdir /c mount c: /c

    • Then it is possible to navigate the windows file system in c: within the dos shell. For example:
      cd /c/Documents\ and\ Settings/yourstruely/Desktop/
      will change directory to your desktop. (Replace yourstruely with your user name, of course.)
  1. Now set some windows environment variables to make our lives with cygwin easier. (Control Panel->System->Advanced->Environment variables)
    • Append the following to the Path variable
      C:\cygwin\usr\X11R6\bin;C:\cygwin\bin
    • Add a new variable called DISPLAY and set the value as follows
      localhost:0.0
      .
  2. There is a file named startxwin.bat in the folder C:\cygwin\usr\X11R6\bin\ copy it to c:\cygwin (Cygwin root). Then rename the new file as xterm.bat. Edit the line
    %RUN% xterm -e   /usr/bin/bash -l
    to be like
    %RUN% xterm -fn '*18*' -e   /usr/bin/bash -l
  3. Create a windows shortcut targeting C:\cygwin\xterm.bat
  4. Now double clicking on the shortcut should open a Cygwin window with large fonts.
    X-terminal with large fonts, on windows desktop.
Note
In the future sections we shall be using the cygwin shell opened using this method.

Does the DOS shell bother you?

Each time you open Cygwin, it will create a DOS shell that is practically useless. But you may notice that if you close it (by clicking on the cross at the top right corner) your Cygwin/X window will also be gone!

If this DOS shell bothers you, there is a way around it. Download this file and expand it to get the file called xterm.vbs. Copy this to your Cygwin root (normally this is C:\cygwin). Then create a shortcut to the file on your desktop. To start Cygwin/X, click on this shortcut and you will not be bothered by a DOS shell!

A good text editor

There is more information on text editing and editors here.

Try not to use the windows notepad to edit data or script files you use in cygwin. Instead use the nedit (called nirvana editor) that comes with cygwin. If it is not there, simply install it following the section below.

Nedit can be called from the cygwin bash shell:
$ nedit &

Now what?

There are hundreds of very useful things that one can when a Unix shell. Now that it is there, you may want to read this article which list some uses of the Cygwin for useful activities like simple data processing.


Adding programs

Cygwin has a wealth of packages that can be useful for different situations. Installing the whole darn thing is not the way to go. It can certainly be done, but, is redundant and can be an immense waste of time to download and install them all. Instead, it is possible to invoke the setup.exe later (remember: we copied it to c:\cygwin folder.) and install any additional packages as they become necessary.

Excel or excel: Data processing without getting your hands dirty

This section makes use of basic UNIX utilities like awk, sed and bash shell. You may want to go through the following resources briefly, before continuing with this section.

  1. Bash guide
  2. GNU awk (gawk) guide
  3. Sed and sed FAQ.

Spreadsheets are useful programs for performing small analyses and testing ideas that involove a bit of computations. The problem comes when one wants to handle a dataset that has about one million rows. (One never has to handle THAT big files? How long is a hourly rainfall series covering 100 years? 876600. That's pretty much near one million.) Many spreadsheets have strict limitations on the number of rows and columns they can handle. (Microsoft Excel can have 65536 rows and 256 columns in one worksheet.)

Of course it is possible to break your data in to parts, say with 60,000 rows and process them in different worksheets. But, look. We are using a computer and this darn thing is supposed to save our time!

There are two extremely useful Unix utilities for processing long text files: awk and sed.

In the following brief introductions we use a real-world dataset to demonstrate some of the possibilities of those programs. See this section on how to download and prepare data, before proceeding.

Download and prepare test data

We use a rainfall dataset covering 1993-09-30 13:20 to 1999-12-08 10:39 downloaded from the [National Technical University of Athens] Greece. The data is presented at 10min time step, so the file has some 300,000 records. That's a good bit of data for us to work on. First, do the following:

Wget command can download anything on the web without using a browser, just on the commandline. If you don't have wget command, install the package using Cygwin setup.

  • Download data.
  • Expand the compressed file. A text file named 'Qim4.txt' is the result.

Here are the commands needed to do that

wget http://assela.pathirana.net/images/f/fe/Sample-rainfall.bz2
bunzip2 Sample-rainfall.bz2 

A sample of the content of the expanded file can be checked easily by

head -n5 Sample-rainfall #this command gives the first five lines of the file

or

tail -n10 Sample-rainfall #the last ten lines

Former command should output someting like

1993-09-30 13:20        0.0
1993-09-30 13:30        0.0
1993-09-30 13:51        0.0
1993-09-30 14:01        0.0
1993-09-30 14:11        0.0


In comes awk

We shall calculate the sum of rainfall field, the whole three hundred thousand odd values.

cat Sample-rainfall |awk '{sum=sum+$3}END{print sum}'

Now, let's look at this line a bit closely.

  1. cat command just lists the file. (Try doing a cat Sample-rainfall and see what happens.)
  2. | is the pipe command that passes the output of the command in the left hand of it as input to the one on its right hand.
  3. Then comes awk. It is usual to surround the awk program (if you can call those few characters that!) with single quotes. The first braces ({}) within the quotes indicate the main part of the program, which is executed once for each input line (record). (i.e. once for the first line 1993-09-30 13:20 0.0 and once for the second: 1993-09-30 13:30 0.0, and so on ...) The $ notation is special in awk, which indicates the chunk (field) of input of the current record. For example, $1 is the first chunk, $2, the second and so on. The default field-separator in awk is whitespace (i.e. spaces or tabs). Therefore, $3 in our program above is the third field, namely, the rainfall value. (First field is the date and second is time in our sample file.)
    • So what happens here is the value of the third field (i.e. rainfall in this case) at each record is added to the previous value of the variable sum (which has initial value of 0).
    • the '{}' after the keyword END is executed only after all the input lines have been processed. It simply prints the final value of sum.

A few more examples

  • Total rainfall of year 1995:
 cat Sample-rainfall | awk '{if( $1 ~ /1995/ )sum=sum+$3}END{print sum}' 
  • The average rainfall of months of September:
 cat Sample-rainfall | awk '{if( $1 ~ /-09-/ ){sum=sum+$3;ct=ct+1}}END{print sum/ct*30*24*6}'
  • Maximum recorded rainfall value
 cat Sample-rainfall | awk 'BEGIN{max=-9999;maxrec=""}{if($3>max){max=$3;maxrec=$0}}END{print maxrec}'
gives
1997-10-25 14:18        20.5
as answer.
  • The amount of day-time rainfalls, night-time rainfall and their ratio.
cat Sample-rainfall | awk -F'[ \t:]' \
'{time=$2*60+$3;if(time>360 && time<1080) {day=day+$4}else{night=night+$4}}\
END{print day, night,day/night}'
will results in something like
1624 1377.9 1.17861
, meaning day rainfall amount was 1.18 times the night rainfall amount. Now, here we perform some additional tricks. First
-F'[ \t:]'
option to awk asks the program to treat space (' '), tab (^t) and colon(:) as field seperators. As a result, now there will be four fields in each record, instead of three in those of the previous examples (time field will be broken at the colon to two). Then, during processing we compute the number of minutes since midnight by
time=$2*60+$3
and use them to decide the particular record is in day times (between 360 and 1080) or night time.

In this example, we use the date functions, which are unique to GNU awk (or gawk). Today the awk version present in cygwin and most Linux distributions is gawk. See this article] for more information of date functions.

  • Finding the missing data in the time-series (Now this is a bit complicated example)
cat Sample-rainfall | awk -F'[ \t:-]' \
'{ct++;time=mktime($1" "$2" "$3" "$4" "$5" 00");\
if(NR>1&&time-oldtime>660)\
{print $0  oldfield,time-oldtime; sum=sum+(time-oldtime-600)};\
oldtime=time; oldfield=$0;}END{print sum/3600/24}' \
|sort -k6 -n > tmp
    • Here's how it works:
      1. We split records at space, tab, colon(:), hyphen, so that each record result in six fields. For example the record
1993-09-30 13:20        0.0
is split as
1993 09 30 13 20 0.0
F1   F2 F3 F4 F5 F6
.
      1. mktime function creates a number representing the number of seconds since 1970-01-01 00:00:00 UTC, based on the datespec given in string form in the argument[1]. Typical usage of the function is
mktime("YYYY MM DD HH MM SS")
mktime("1993 09 30 13 20")"
      1. At each row, we save the current time as oldtime, to be used in the next row.
      2. At each row, we compare current time with old time and if the different is larger than 660 (11 minutes), we print current line, previous like and the differnce of time in seconds
      3. We add the number of missing seconds to the variable sum
      4. After end of the rows, we print sum, converted from seconds to days.
      5. finally instead of just printing out the results, we
        1. filter the results through sort, so that the largest gaps will come towards the end.
        2. redirect the output to a file, so that we can examine the result leasurely.
    • You may find that this time-series has 135 days worth of data missing, and the largest gap is 3283440 seconds (38 days) between 1997-08-01 11:09 and 1997-06-24 11:05.

Perhaps this is a good time to mention that writing scripts to files can be a good practice to save your time. A parting word here: This is just a cursory introduction; there's no time or necessity to cover the elements of the language. For that have a look at the GNU awk Users Guide. Also try Google for good resources on awk.

Writing your scripts to files

Short scripts involving a few words (e.g. First example in the section on awk) can be written on-the-fly at the command prompt like:

$  cat Sample-rainfall | awk '{if( $1 ~ /1995/ )sum=sum+$3}END{print sum}' 
379.8

But, when it comes to a bit more involved examples like the one in the last example of the same section, it is best to develop your script in a file. That gives the additional advantage of being able to keep it saved for future use.

To do that simply open a text editor. (for the time-being use notepad, but later you will learn about a better editor, if you are up to it.) You can call notepad directly from cygwin!

notepad myfirstscript.bash &
Avbove command will open notepad with the blank file myfirstscript.bash and return the control to cygwin without waiting for notepad to finish (& does this). The .bash part does not signify anything special, but I nevertheless use it since it 1) allows me to identify my scripts when I am in the windows file manager. 2) Avoid notepad stupidly adding .txt extension uninvited! and 3) possible to assign a default editor (e.g. wordpad or notepad) in the windows system to edit the files when we click on them. Then as the first line of the file write
#!/bin/bash

This is not so meaningful at the moment and will be explained in later sections.

Then simply write your script in the file. The content of the file would be (for the last example of the awk section)

#!/bin/bash
cat Sample-rainfall | awk -F'[ \t:-]' '{ct++;time=mktime($1" "$2" "$3" 

"$4" "$5" 00");\
if(NR>1&&time-oldtime>660)\
{print $0  oldfield,time-oldtime; sum=sum+(time-oldtime-600)};\
oldtime=time; oldfield=$0;}END{print sum/3600/24}' |sort -k6 -n > tmp

For information on the chmod command and file permissions in general read this article.

Then save the file, but don't exit notepad. At the cygwin prompt, type

chmod u+x myfirstscript.bash

Then to run the script simply execute the file by calling it by name: </nowiki>

./myfirstscript

Let's BASH

Read this article for details on UNIX shells in general.

Any operating system has an outer layer, that is visible to the user, which acts as the agent between the internals of the OS and the user. This is called a shell. UNIX has several shells. We restrict our discussion to perhaps the most popular one among these, the Bourne-again shell.

Our objective here is to learn how to run a series of commands conveniently. Loosely, a file with such a series of commands is called a script and when we deal with shell commands, we call it a shell script. In this section we use simplest of the BASH shell scripts to accomplish some of our tasks.

First example

In fact, we have already done our first scripting in a previous section. But let's examine a different shell script, a bit more closely.

#!/bin/bash
cd /tmp
ls 
This script does two things: First, change directory to /tmp and then list the files. So the files that are listed will be those in /tmp (not the files in the place where you ran the script).

Here's another example, which uses some concepts we have learned so far, plus several new ones:

#!/bin/bash
#program extract_month.bash
year="1995"
month="04"
outfile="DATA$year-${month}.txt"
echo "We will extract the time series for $year $month ..."
cat Sample-rainfall | awk -v yr=$year -v mon=$month -F'[-]' \
'{if($1==yr && $2==mon){print $0}}' > $outfile

There are a number of important things to notice here:

  • year="1995" or (var=value in general) is the assignment operator in BASH. It says "Create a variable year and set the value of the variable to 1995". Quotes are not a part of the value and are optional here. (Nevertheless they are important in some situations, so it is a good idea to write them anyway.)
  • $year or ${year}($var or ${var} in general) is how we refer to the value of a variable in bash. In most of the circumstances just $var is adequate, but the strictly accurate form is ${var} , and is mandatory in some situations.
  • " some string $var another string" (Variables surrounded by strings quoted in double quotes): In such situations the variable inside is substituted by bash with its value. (However, single quotes, e.g. 'some string $var another string', will not work!
  • We use the awk optoin -v the purpose of which is to pass variables into awk.
    • for example: echo 'Hello!' |awk -v obj=world '{print $1, obj}' would result in Hello! world.
    • We pass the value of variables yr and mon to awk here.

Go west, young man!

There are number of excellent resources, most of them are free and online, to learn the art of BASH programming. Now's a good time to take a detour and try some of them out. Particularly there are two very good resources at The Linux Documentation Project: Bash Guide for Beginners by Machtelt Garrels and Advanced Bash-Scripting Guide by Mendel Cooper. If you want a printed book, Classic Shell Scripting (ISBN:0596005954 ) by by Arnold Robbins and Nelson H.F. Beebe, may be a good introductory book.

Do not plan to hone your BASH programming skills solely from these pages. You definitely will need supplimentary material to make good progress.

Calculating monthly totals of a given year

This time we will really push it! We will write three small scripts that will together accomplish a task, namely calculating the total monthly rainfall of a given year, with indications of the 'quality' of those totals. In detail, we want to:

  1. Select an year from the time series
  2. For each month in that year, compute the total recorded rainfall
  3. Also compute number of records
  4. Comparing the number of records with the possible number of records of a full series (i.e. Time duration of the month divided by the time step of the series) we want to compute a percentage indicator of the quality of our estimate.

Part I

The first script computes the monthly total rainfall of a given year from our series. This time we want to be a bit general, so rather than hard coding things in the script, we write it so that the user (that's another big name for us!) can specify them when they run the script. Editing a file before each run is not a big deal, but this approach of generalizing a bit helps in some situations.

#!/bin/bash
#program monthly-total.bash (monthly totals)
if [ $# -ne 3 ]; then
  echo "$0 computes the monthly total rainfall of a given year "
  echo "from a text file containing rainfall records"
  echo "Usage: $0 <infile> <year> <tstep> "
  echo "<infile> should be a text file with rainfall records in the format:"
  echo "   YYYY-MM-DD HH24:MM VALUE"
  echo "<year> is the year to be computed" 
  echo "<timestep> is the timesetep size of the series in minutes." 
  exit 1
fi
infile=$1
year=$2
tstep=$3
echo "   Doing file: $infile for year $year (timestep $tstep minutes)..." 1>&2
cat $infile|awk  -v yr=$year -v scale=$tstep 'BEGIN{FS="[\t -]"} $1 ~ "^"yr \
  {sum[$2]=sum[$2]+$5;ct[$2]++;mn[$2]=$2}  \
    END{for (item in mn){tmp=(ct[item]>0?scale/24/60*ct[item]:0);\
     printf "%s\t%5.3f\t%5.3f\n", item, sum[item], tmp}}' |sort -n
echo "   ... done" 1>&2

Let's go through the code

Always copy the values in positional parameters to other variables, without using them as they are. Command line arguments are not the only entities that create positional parameters -- they can be created by other commands.

  • The first if - fi (fi is the bash word for 'end if'!) block checks how the user has called the program. (e.g. like ./monthly-total.bash, or ./monthly-total.bash foobar. In bash, all the command-line arguments (a fancy name for things like "foobar" in the second example in the previous sentense.) are passed as $n (where n is an integer; so we may have $1, $2, ...). $# is a special variable that indicate the number of parameters passed, which can be zero or larger. In this example, we check whether there is exactly three parameters passed; if not the user is either testing the waters by calling ./monthly-total.bash (no arguments) or has given more than three arguments. Then, instead of continuing with the rest of the computations, it just print some helpful messages (echo commands). It says that we should give three arguments, namely, input file name, year to be extracted and the time-step of the series. Then the script exits.
  • When we call it as follows
./monthly-total.bash Sample-rainfall 1995 10
, we bypass the if trap and proceed further.
  • Then the commandline arguments (strictly posistional parameters) are copied to some descriptive variables (infile, year, tistep). This is not absolutely essential -- we can use $1, $2, etc. as they are throughout the script, but still a good idea.

In the bash shell, it is possible to conveniently separate STDOUT and STDERR.

  • There is a special echo statement here:
echo "foo-bar" 1>&2
. There are two basic output streams in a computer program, namely standard-output (STDOUT) and standard-error (STDERR). Print commands (including echo) generally write to STDOUT. When something goes wrong, the error messages are written to STDERR. We denote STDOUT stream by 1 and STDERR by 2, and 1>&2 means, 'divert STDOUT to STDERR, for this time' -- in other words, output of echo is written to STDERR. The reason why we do this is: We write all the useful output values using print statements to STDOUT. So, if we write out diagnostic information also there, things get mixed-up!

There is an important difference between arrays in, say bash or FORTRAN language and those of awk. Awk arrays are associative, meaning array indecies are, in fact, strings. So things like the one below are quite legal:

echo|awk '{s["b"]="sh";\
 s["fo"]="ba"; \
 print s["b"],s["fo"]}'
  • Also, we do some serious awk processing here.
    • We do pattern matching:
$1 ~ "^"yr
matches the records that have field 1 starting with the value of yr (something like "1993" or "1995") and passes only those matching records to the heart of the awk processing within the next set of braces.
    • Awk has arrays: sum[i] in awk indicates the value of array sum at key i. It is not necessary to declare arrays before using them. We cheat our way through the task, exploiting these properties.
    • for (var in array) construct can be used to go though the elements of the array, one by one. However, it should be noted that there is no guarentee at what order awk sends the elements of the array.
    • Because of the above issue we later pipe the awk output through sort command.
    • We use the printf function to create a bit neater output.

Part II

There are a few new concepts here:

  • [ -f "$2" ] checks whether the file given in the second positional argument exists.
  • [ test1 ] && [ test2 ] is the logical AND operator, returns true only if both tests are true. (similarly [ test1 ] || [ test2 ] is the OR operator).
  • Then we learn the concept of functions here. dyn is a function that takes two arguments: A string giving YYYY type year and another giving an integer representing a month (Jan=1, Dec=12).
    • First, it test whether the arguments 1 and 2 are non-empty ([ -z "$1" ]).
    • Then we use the UNIX date command to get the number of days since the start of the year to the first of the month.

backquotes, "`", (this key is usually located on the top-left corner of the key board next to number 1) can be used to capture the output of a command. e.g.

 ls -l `which bash`
The output of which command is captured and used as argument for ls command.
    • We capture the output of the date command to the variable ds by using the funny brackets called backquotes (` cmd ``).
  • We have used the for var in val1 val2 val3 ... construct here. It gives the variable var the values val1, val2, val3 ... in tern and executes the stuff within the do ... done block.

#!/bin/bash
#program days-in-month.bash (days of each month in a given year)
if [ $# -lt 1 ]; then
  echo "$0 computes the  number of days in each month of a given year"
  echo "Usage: $0  <year> [<filename>]  "
  echo "<year> is the year to be computed in YYYY form " 
  echo "If the optional argument file is given. "
  echo "The files first column is read as month numbers to be computed. "
  echo "Instead of total 12 months, only those months are computed"
  exit 1
fi
year=$1
echo "      computing with year : $1 " 1>&2
if [ ! -z $2 ] && [ -f "$2" ]; then  # we have a valid file name
  echo "     also with $2 as month list " 1>&2
  file=$2
fi 
## define some useful functions
dyn(){
  #computes the number of dates since the start of the year to the start of this month. 
  # usage dyn year month
  if [ -z "$1" ] || [ -z "$2" ]; then
    echo -1
  fi
  local ds=`date -d"$2-$1-01" +%j`
  local tmp=${ds#0}
  echo ${tmp#0}
}
#list of months to be computed
mnths="1 2 3 4 5 6 7 8 9 10 11 12"
#however, if there's a file specified as optional argument,
# then take the values from the first column of the file, not the above. 
if [ "$file" ]; then
  mnths=`cat $file|awk '{printf "%i ",$1}'`
  #give the user a bit of a information on what is happening. 
  echo 'computing restricted to only the months: ' $mnths 1>&2 
fi
### now compute
for  mn  in $mnths ; do  
  if [ "$mn" -ne "12" ]; then 
    mtemp=$(( $mn + 1 ))
    ytemp=$year
    sd=`dyn $mn $year` 
    ed=`dyn $mtemp $ytemp` 
#bash arrays start at index 0, but there's no harm in starting ours from 1
    echo "$mn    $(( $ed - $sd ))"
  else
#december !
    echo "12    31"
  fi
done

Part III

#!/bin/bash
#program mon-tot-with-qual.bash (monthly totals)
if [ $# -lt 3 ]; then
  echo "$0 computes the monthly total rainfall of a given year from a text file containing rainfall records"
  echo "Usage: $0 <infile> <year> <tstep> "
  echo "<infile> should be a text file with rainfall records in the format:"
  echo "   YYYY-MM-DD HH24:MM VALUE"
  echo "<year> is the year to be computed" 
  echo "<timestep> is the timesetep size of the series." 
  echo "The output is in the follwing form "
  echo "<month> <totalrainfall> <approx %of missing data> "
  exit 1
fi
#call monthly total 
echo "calling totalling script.." 1>&2 
./monthly-total.bash $1 $2 $3 > $$.tmp.1
echo "done." 1>&2 
echo "Computing the dates in each month" 1>&2 
./days-in-month.bash  $2 $$.tmp.1 > $$.tmp.2
echo "done"  1>&2 
echo "Combining results.."  1>&2 
paste $$.tmp.1 $$.tmp.2 |awk '{printf "%i\t%6.2f\t%4.1f\n", $1, $2, ($5-$3)/$5*100}'
echo "..done." 1>&2 
rm -f $$.tmp.?

Sed

Sed is another tools somewhat similar to awk. Probably it is possible to do all that is done by this program using only awk, but certain tasks are so much easier if you know a little bit of it. Here, a little means, perhaps, only the famous s command. Let's start directly with an example. (try sed at your bash prompt. If it does not give the help text of sed, may be you don't have sed installed in your cygwin setup. Please install it before continuing.)

Let's say we wasn't to separate YYYY, MM and DD of each line by separating the hyphen(-). We can do this with awk, but changing the Field Separators. However, it is much easier with sed:

cat Sample-rainfall|sed 's/-/ /g' > new-rainfall 
#remember, there is a space between second and thirid slash
and the file new-rainfall will have YYY MM DD, instead of the original YYYY-MM-DD. Pretty boring, eh? Well, lets go for something less boring in the next example. But, before that, let's try to understand how this works.
  1. As in awk, we cover sed commands with single quotes ('), so that our bash shell would not jump in and try to interpret them.
  2. s means, 'substitute'
  3. s/foo/bar/ means, 'substitute (first occurrence of) foo in each line, with bar.
  4. When we add a g, instead of 'first occurrence', sed with replace ALL occurrences of foo with bar.

As always, we don't want to reinvent the wheel here. So, here is a list of references for sed. But, remember, sed is not as well documented as awk. In this case, you may be better off by buying a book.

Sed resources on the web are not as good as those for other GNU programs like, say awk. If you like to go beyond the basics, perhaps you should get hold of a book (e.g. ISBN:1565922255 ).

  1. wikipedia article on sed. (Has some examples and useful links for other sites.
  2. One of the many sed FAQs.
  3. Or buy the book sed & awk (2nd Edition) (ISBN:1565922255 ) by by Dale Dougherty and Arnold Robbins is a good book.

Now to the example: We want to replace all the delimiters (-: and 'tab' character) in the file with a single space each. However, we want to take care of something else -- we don't want to remove any minus signs associated with values (of course in this example, there are not negative values, no negative rainfalls! or negative dates!!)

cat Sample-rainfall |sed 's/\([0-9]\)-\([0-9]\)/\1 \2/g; s/[\t:]/ /g' 
How this works
  • First the structure of the command:
.*
means zero or more (*) of any character (.). And
[abc]
means either character a or b;
(foo|bar) 
means either foo or bar. Further ( ) parentheses keep the part matching the regular expression inside it in the memory so that we can use it later in the expression.
      • Some of these patterns need to be 'escaped' in sed. (e.g. If we write (foo|bar) (like in last example) sed will literally intepret the parentheses and the pipe (|). To tell sed that they are metacharacters, we have to write the expression as \(foo\|bar\).

Lack of documentation combined with the necessity of escaping some wikipedia:metacharacters while others don't need it, has made sed look a bit like a form of black art. Best way to write accurate sed script is perhaps to test them with simple input.

Now let's try to understand the example
  • [0-9] "says Any character from 0 to 9". Parentheses, ( ) keeps whatever we caught in the memory so that we can use it later.
  • Then look at the replacement part. \1 means, stuff caught by the first pair of parentheses and \2 by the second.
  • After we close the replacement part, comes the options. Here we have g, which means, do that for all occurences of matching patterns. If this is left out, only the first occurrence will be replaced.

All in all our script says:

Find all the occurrences of a hyphen sandwiched between two digits and replace those hyphens with a space.

A legitimate minus sign can not come sandwiched between two numbers. Hence, if with a n-m<tt> where <tt>n and m are digits, then what is in the middle should be a hypen, not a minus sign.

There is another way to write the same script, which makes it shorter, and much more general (i.e. applicable for wider range of files).

 cat Sample-rainfall  |sed 's/\(\w\)-\(\w\)/\1 \2/g'
Here \w means a word. So we ask sed to replace each hyphen sandwiched between two words with a whitespace.

That's all we cover on sed.

Handling Two Dimensional Arrays

Datasets with dimensions larger than 2, that we encounter in practice are often too heavy to be handled by scripts or spreadsheet programs. Often it is much efficient to write a small computer program using a language like FORTRAN or Java, or use a specialist program

Often we are faced with datasets varying in more than one dimension. In this section we cover some of the techniques to extend what we have learned so far to 2-dimensional arrays. Obvious question: What should we do for 3-D (or higher) arrays. Except for very simplest cases, don't bother with your awk or sed! (Never dream of using Excel either!) There are two ways to attack d>2. One is writing your own program. Even the result of the code suffered in the hand of the most clumsy programmer is hundreds of times faster than our shell scripts, awk, sed or Excel. The other is to use a specialized (often commercial) tool like MatLAB, Mathematica, etc.

Now back to track!

Two dimensional data sample

Read this article for an application with demonstration on real-time plotting of GFAS data

Global Flood Alert System (GFAS) produce daily rainfall maps of the world in real-time basis. We shall use on of these datasets to operate on with our battery of tools. Download the sample dataset from here. Alternatively, you can download a more current dataset from the GFAS trial site. (Please note that this data is subject to the terms and conditions of NASA.)

bunzip2  Sample-spatial.bz2 #will create a file named Sample-spatial

This file is a daily rainfall 'map' covering the region from 90E to 150E and 60S to 60N at 0.25 degree resolution. There are 240x480 points (values). The first value at 90.25E,-59.75S and last value 149.75E,59.75N.

This data set has a redundant field seperators (comma, in addition to space) that can be confusing. Let's remove the comma and make a new dataset with only space separated values.

cat Sample-spatial|sed 's/,//g' > Sample-spatial.2

Now let's try a very simple operation. Suppose we want to create a single column dataset, by 'unfolding' this 2-D dataset from top to bottom along rows. We can easily do this with awk .

cat Sample-spatial.2 | \
awk  '{for(i=1;i<=NF; i++){print $i}}' > Sample-spatial.txt

The only thing that is new to us here is the use of the special variable NS. In awk, NS means the number of fields in the current row and NR means the number of records (or 'rows') in the file.

It is always a good idea to check the length of the resulting file. (what if our original file has a short row with less than 240 values!) One way is to count the number of values:

cat -n Sample-spatial.txt| tail -n1 
#should give 115200 with the value in the last row. (something like 115200  xx.xx)

Regridding data

There are more versatile tools for advanced 2-D data operations than shell scripts with awk and sed. One such example is the GMT Tools, which we will cover in a later section

One of the common operations on 2-D data is regridding the data in to a coaser resolution (say nxxny pixels into a single pixel). We shall write a script that can do this operation in a bit general way. We also shall learn how to do write our awk scripts in a much cleaner and maintainable way than squeezing them on to the command line.

  1. First save the following code in to a file named spatialavg.awk
#PROGRAM spatialavg.swk
BEGIN{
if (nx<2 || ny<2){
  print "wrong values for nx,ny",nx,ny
  exit
  }
}
{
#now we are processing rows
#first read all the values in to memory
  for(xx=1;xx<=NF;xx++){
    val[NR,xx]=$(xx)
  }
}
END{ # now we are done
    for(ly=0;ly<=NR/ny-1;ly++){
      for(lx=0;lx<=NF/nx-1;lx++){
        out=0
        for(sy=0;sy<ny;sy++){
          for(sx=0;sx<nx;sx++){
            out+=val[(ly*ny+sy+1),(lx*nx+sx+1)]

          }
        }
        printf "%5.3f¥t", out/nx/ny
      }
      printf "¥n"
    }
  }
  1. We can run the script as follows
 
awk -v nx=2 -v ny=2 -f spatialavg.awk Sample-spatial.2 \
       > Sample-spatial.3 # average each 2x2 pixels 


When to & when NOT to

Use the correct tool for the task at hand!

  • If you have a dataset with only hundread rows, and you need to check fifty different types of graphs with that, by all means use a spreadsheet.
  • To do an involved numerical calculation related to atmospheric physics, write a program using a 'proper' language like FORTRAN, or still better find a one that suites your need and start modifying it. The GNU awk users guide say the following about the appropriateness of awk:
If you find yourself writing awk scripts of more than, say, a few hundred lines, you might consider using a different programming language. Emacs Lisp is a good choice if you need sophisticated string or pattern matching capabilities. The shell is also good at string and pattern matching; in addition, it allows powerful use of the system utilities. More conventional languages, such as C, C++, and Java, offer better facilities for system programming and for managing the complexity of large programs. Programs in these languages may require more lines of source code than the equivalent awk programs, but they are easier to maintain and usually run more efficiently.

Error handling in computer scripts is something that we often forget, and mostly get away with. But occasionally it leads to total disaster. Read this article on error handling in bash, before attempting anything substantial than processing some data.

  • To process 2-d geographical data, a GIS system is more appropriate.
  • But, having said all that, I have found that there are hundreds of day to day tasks that often fall between the above standard tools, and awk, sed and shell scripting, taken together is a powerful framework to handle many of them.

Personal tools