CSI2131 File Management Winter 2003

Assignment #1 (worth 8%, 80 marks, due date: Tuesday February 11, 10:30 a.m.)

PART I: PROGRAMMING PART  (70 marks)
========================

Overview 
-------- 
In this assignment, you will  apply your knowledge of fundamental file
operations in  C++ and data  compression to write two  simple programs
for compressing and uncompressing images.  The program for compressing
images takes a text file containing  an "image" and outputs a new file
which  is a compressed  version of  the input  file.  The  program for
uncompressing  images takes  the  compressed file  and reproduces  the
original image.


Run-Length Encoding
------------------- 

Run-Length Encoding (RLE) is a simple technique for compressing files,
which works  as follows.  First, we choose  one special  (unused) byte
value as run-length  code indicator. Then we search  for repeated runs
of a single byte value in an  input file, and replace them by a single
instance of the code indicator, the  byte value, and a run count.  You
may consult Section 6.1.2 in our textbook to revise the algorithm.  To
uncompress  an encoded  file, it  is enough  to sequentially  read the
compressed file,  searching for the run-length  code indicators.  Each
time you encounter such an indicator, expand the repeated value by the
number  of time  it  has to  be  repeated.  For  this assignment,  use
character "*" as the run-length code indicator.

Reading Material (for further reading but not essential to your task)
----------------
http://www.rasip.fer.hr/research/compress/algorithms/fund/rl/index.html


Representing Images
-------------------
Images files that you will use are just ASCII files. The following is 
an example of such an "image":

         ooooooxxoooooooooooooaaooooooooooooooooo
         ooooooxxxxooooooooooaaaaaooooooooooooooo
         ooooooxxxxxxxoooooooooaaoooooxxxoooooooo
         oooooooooxxxoooooooooooaoxxxxxxxxooooooo
         ooooooooooooooxxooooooooooxxxxxooooooooo
         oooooooooooxxxxxxoooooooooooxxoooooooooo
         ooooooooxxxxxxxxxoooooooooaaoooooooooooo
         ooooooooooooxxxxxoooooooooaaaooooooooooo
         ooooooaaaoooooxxxooooooooaaaaooooooooooo
         ooooaaaaaoooooooooooooooooaaoooooooooooo
         oooooooaaoooooooooooooooxxxxxxxxoooooooo
         ooooooooooooooxxooooooooooxxxxxooooooooo
         aoooooooooooxxxxxxoooooooooooxxooooooooo
         aaaaooooooooxxxxxxxxxooooooooooooooooooo
         oaaoooooooooooxxxxxooooooooooaaaaooooooo
         ooooooooooooooxxxoooooooooaaaaaaaooooooo

This image contains 3 sorts of characters which are a's, o's, and x's.
Imagine that the image represents a space image where the o's are some
background colour, the  a's are certain colour brighter  than the o's,
and the  x's are the brightest  colour available.  In  fact, one could
use blanks instead  of the o's to represent  the background colour. At
the end of each line there  are characters representing a new line (in
DOS, these are 2 characters). You  can create such an image by using a
text  editor such as  notepad.  Try  to create  both dense  and sparse
images  to  test  your  algorithm   and  verify  how  it  affects  its
compression ratio.

Command Line Arguments
----------------------

The input  file name for  your programs will  be given on  the command
line.   This  means  that  after  your  program  is  compiled  and  an
executable is created,  you are going to run it on  the DOS prompt and
specify some arguments which will  represent options and file names to
be  used in  your  program.  Details  on  how to  access command  line
arguments inside your program will  be explained in a separate page in
the web. In  short, the main mechanism to do that  involves the use of
arguments "argc" and "argv" in the main function:
   int main (int argc, char* argv[]) { ... }
You may look for information on  the use of these arguments on the web
or on a C++ book.


Programs to be created
----------------------
1) compress

Create a  project "compress"  with a single  program ("compress.cpp");
the executable program will be called "compress.exe" by default, since
"compress" is the project name.

This program (compress.exe)  will be executed at the  DOS prompt where
the  user  can specify  command  line  arguments  using the  following
format:
               compress.exe [-v] <image>
or alternatively (the user may omit the extension .exe):
               compress [-v] <image>   

where <image> is an existent input file and -v (for "verbose mode") is
an optional flag. If verbose mode is selected, your program must print
a  message  to  the  standard  error  telling the  user  what  is  the
compression  ratio achieved, which  is the  ratio: size  of compressed
file divided by the size of original file.

The resulting  program output (compressed  file) must be written  to a
file  with the  same name  as the  input file,  but extended  with the
suffix ".rle". 

Examples: If you run:  compress.exe galaxyNGC891 
          the compressed output file should be "galaxyNGC891.rle"  
          If you  run: compress.exe galaxyNGC891.txt
          the compressed output file should be "galaxyNGC891.txt.rle"

2) uncompress

Similarly to compress,  you will create a project  "uncompress" with a
single    program     "uncompress.cpp"    creating    an    executable
"uncompress.exe".  This program  will receive  a file  created  by the
program "compress" and re-create the original, uncompressed file.
The following format will be used for uncompress:

               uncompress <compressed image>   

where  <compressed   image>  is  your   existent  input  file   to  be
uncompressed.  The  program should  check that <compressed  image> has
the extension ".rle"  and if it doesn't it  must append such extension
to it.  In this  way, the user  is given  the option of  providing the
suffix or not.

So a command such as "uncompress galaxyNGC891" should be understood by
the program as "uncompress  galaxyNGC891.rle". The resulting output of
the program must be written to a  file with the same name as the input
file,  but without the  extension ".rle".  So if  your input  file was
"galaxyNGC891.rle", the outputfile must be named "galaxyNGC891".
 
Handling Erroneous Inputs
-------------------------
Your programs must handle the following erroneous cases:
  - incorrect use of command line arguments
  - compressing or uncompressing a non-existent file
  - creating a compressed or uncompressed file when another file  with
    the same name already exists.

Make things very simple and do not prompt the user for an alternate 
input. 

Documentation and Style
-----------------------

Your program must be well-documented and written in good style.  Part
of  the  marks will  be  dedicated  to  documentation and  style.   To
document your program, write comments  explaining what is done in the
following  lines  and  add  comments  at  the  end  of  certain  lines
explaining what  has been  done on that  line.  For each  function and
procedure, add  a full explanation  before it, explaining  its purpose
and what  its parameters  represent.  For each  class created,  add an
explanation before it.  Good style includes good documentation, choice
of clear names (for  variables, classes, functions, etc), good program
organization and  design (like creating functions that  deal with each
part of the program and/or creating classes where appropriate).
NOTE: Although it is a  preferable style to place different classes in
different files, if you create  any classes for this program, they all
must be included in the same  program file in order to comply with the
standards described below.

Testing
-------
You must test your programs with many different command line arguments
covering:
* invalid inputs (testing every possible user error);
* valid inputs (using various different image files).
We shall provide one or more image files for you to test your program;
they will be included in  the webCT assignment page. You should create
various image  files for  testing, one of  which you will  submit.  We
will test your program with the  files provided as well as other files
of our choice; we will also test a wide range of user errors.

What to Submit
--------------

Hand in the following files: 
compress.cpp   (program for compression)
uncompress.cpp (program for uncompression)
image1.txt     (file with a 20x20 ASCII image created by you)
image1.txt.rle (the compressed file produced by running your program 
                on image1.txt)  
written.txt    (textfile containing your answers to written questions)

EVERY FILE  MUST CONTAIN A  HEADER WITH Student Name,  Student Number,
Course and Section. Programs that do not comply with this standards or
with the standard file names above will receive mark deductions.

Create a directory/folder as  explained in Assignment#0, copy the five
files  you  have  created  into  that directory/folder,  and  zip  the
directory. Similarly  to what you  were asked in  Assignment#0, submit
the zipped file as your assignment#1 submission.

Your program must meet the specifications given in this handout and in
any further clarification given later on the web page.
======================================================================

PART II: WRITTEN QUESTIONS
==========================
Answer these questions in paper first  and then type your answers in a
text file called "written.txt" (use a text editor such as notepad).

1) Magnetic Disks (6 marks)

Given a file with the following characteristics:
  length: 1,000,000 bytes
  organization: 2,000 records with 500 bytes per records  

and a disk drive with the following characteristics:
  Number of Bytes per Sectors = 600
  Number of Sectors per Track = 100
  Number of Tracks per Cylinder = 4
  Average seek time = 2 msec
  Average rotational delay = .5 msec
  Transfer time = .01 msec/sector = 1 msec/track

We further assume the following regarding our disk organization:

1-a) Configuration 1:

 .Records are allowed to span two sectors.
 .Consecutive file records are stored in consecutive disk sectors in a
  track.
 .The tracks are spread out randomly across the disk.

1-b) Configuration 2:

 .Only one record is stored in each sector.
 .Consecutive file records are stored in consecutive disk sectors  and
  consecutive disk tracks within each cylinder.
 .No rotational delay is incurred in going from one track to the  other
  within a cylinder during sequential access (i.e.,  the  beginnings of
  the  tracks are appropriately staggered within a cylinder).
 .The cylinders are spread out randomly.

For each of the configurations above, calculate:
A) The time it would take to read this file completely SEQUENTIALLY. 
B) The time it would take to read this file completely using RANDOM 
   ACCESS (traversing each record once in random order).
C) The amount of internal fragmentation present in the file due to 
   unused space in a sector.
For each calculation, show all the steps in your derivation.

2) Magnetic Tapes (4 marks)

Consider the following tape description:
   density = 3,000 bpi
   interblock gap = 0.2 inch
   Speed = 200 ips

and the following file characteristics:
   number of records = 1,000
   record size = 30

Using blocking factors 50 and 100, calculate the following quantities:
A) Amount of tape (in feet) required to store the file 
   (note that 1 foot = 12 inches).
B) Effective recording density.
C) Effective transmission rate.
For each calculation, show all the steps in your derivation.
======================================================================