CSCI 340: Operating Systems
Fall 2007
Program 1: Linux locate utility
Code due: Midnight Thursday September 13
  (copied to your ecst turn-in directory, see below)  
Grading Weight: 1 (programs will have a weight between 1 - 5)


Overview:
The goal of this program is to introduce you to UNIX/Linux system calls.  The program is simple (only about 150 lines of code), but you will need to learn about several system calls, and the syntax of system calls can be tricky.

Implement a rudimentary version of the Linux updatedb & locate utilities.  updatedb creates a database of all the file and directory names on a filesystem.  locate allows the user to search this database for a given filename.

updatedb and locate are separate utilities.  In order to simplify this assignment, your version of locate will (1) build the database and (2) allow the user to enter regular expressions (3) print all files in the database that match the entered regular expressions.

Your program must work like this (I use the name rlocate so my utility is not confused with /usr/bin/locate):

$ rlocate myfiles
enter target> ted
myfiles/ted
enter target> t.d
myfiles/dir/longname_with_tud_in_middle
myfiles/dir/tid.cpp
myfiles/tad
myfiles/ted
myfiles/tod
enter target> ^D

^D means the user holds the <ctrl> key while pressing the <d> key.  It is the UNIX/Linux end-of-file character.

The only command line argument (which is required) is the path (path is a Linux term for directory) that rlocate will recursively search (myfiles in the above example).  After the path has been searched and the database of all filenames and directory names has been created, rlocate prompts the user for a regular expressions.  In the above example the first regular expression is <ted>.  Of all the files in myfiles and its subdirectories, there is only one file that matches <myfiles/ted>

The second regular expression is <t.d> (in regular expressions the "." is a wildcard that stands for any single character).  Of all the files in myfiles and its subdirectories, there are 5 that have a <t.d> in their names.

When the user enters ^D the program terminates.



Program Requirements:

After you traverse all the files in the given path and all its sub-directories, sort them alphanumerically.  This way your output will exactly match my sample output.  

Don't traverse symbolic links (it makes the program harder if you follow symbolic links).

Use the regcomp() and regexec() to implement regular expressions.

When traversing subdirectories, ignore (skip) the directories "." and ".." ("." or ".." would be a legal command line argument).

In order to make testing and grading easier, use the following code for your error messages


cerr << "illegal regular expression: <" << target << ">" << endl;
cerr << "must specify path on command line" << endl;
cerr << "permission denied <" << pathname << ">" << endl;
cerr << "directory does not exist <" << pathname << ">" << endl;
When the user enters an illegal regular expression, print the above error message and then prompt the user for the next regular expression.  Do not terminate your program.


Testing Your Program:

A few days before the due date I will provide sample tests for your program in the tests/p1 directory.

Each test consists of a directory (t01), a shell script to execute your program (t01.cmd), the correct output (t01.out), and the correct error output (t01.err).  I will test your programs as follows:

$ cd to student's directory
$ ~tyson/340/tests/p1/t01.cmd > t01.out 2> t01.err
$ diff t01.out ~tyson/340/tests/p1/t01.out
$ diff t01.err ~tyson/340/tests/p1/t01.err

If there are any differences between the output of your program and the posted correct output, diff will print the difference, you fail this particular test, and you will lose points.  Thus the output of your program must be identical to the output of my program.

I will use tests that I don't post, so it is a good idea to develop some of your own tests.  My goal will be to exercise all the possible errors.

If you are programming using windows, watch out for the extra hidden character at the end of each line (the DOS standard is different than the Linux standard).  I will only post Linux files.  However, you can convert your DOS files to Linux files using the command dos2Linux.


Hints:

When working with system calls, it is very helpful to program incrementally.  Get a little piece of the program working and tested before you move on.

YOU MUST ALWAYS CHECK THE RETURN VALUE OF A SYSTEM CALL.   When a student comes to me  for help on similar assignments the problem is almost always caused by ignoring the value returned from a system call.

The logic of this program is simple, f
iguring how to use the system calls can be tedious.

It is much easier to implement this program recursively.  Specifically, write a build_database function that finds all the files in a directory, and when you find a directory inside of the one currently being considered, call the build_database function on this entry:

bool build_database(char *pathname)
{
for all files F in the current directory
put pathanme/F in the list of files
if F is a directory, call build_database on pathname/F
}
If you use a STL vector to store the names you can sort it using the STL sort function.

Directories can be open using opendir() and the files/directories in the directory can be extracted using readdir().

Directories can be closed using closedir().

The function stat() and the macro S_ISDIR() can be used to determine if a file is a file or a directory.  There is a similar function and macro that can be used to determine if a directory is a symbolic link (I'll leave it to you to find out what they are).

The functions regcomp() and regexec() can be used to implement regular expressions.  Write a small program so that you can learn how regcomp() and regexec() work before you put them in your rlocate program.  Make sure your rlocate program correctly reads filenames before you add regular expressions.    Here is an introduction to regular expressions.

Since system functions use c-style strings, it is much easier to write your entire program using c-style strings instead of C++ strings.  The following c-style string functions maybe helpful:  strcpy(), strcat(), and strcmp().


General Requirements:
I will deduct 10 percent if your program does not compile using the command "g++ -o rlocate rlocate.cpp" on a Linux machine. 

I will grade your program using another program, so if your program does not work exactly as specified you will lose points.  For example, if you put an extra space at the end of the line, or a blank line at the end of the output, you will lose points.  Test your program using the sample tests in the test directory (see above).

Please put your name at the top of all your files.  If you are working with someone, put both your names at the top of all files. 


How to turn in code:
You must turn in the following file:

rlocate.cpp

copy the above file into the directory:
/user/projects/csci340/p1/USERNAME
where USERNAME is your ecst username.

If you are working with someone, only turn in one copy of the assignment.  It is very helpful is people working together always turn their assignments in under the same name.

Notes:   You will not be able to access your turn in directory once the deadline has passed.



Late assignments:
You will lose 10% if your program is 1-24 hours late.
You will lose 20% if your program is 24-48 hours late.
Programs will not be accepted 48 hours past the deadline.

e-mail late assignments to me (avoid .zip files because they are usually removed by virus scanners) 

It may take me much, much longer to grade late assignments (if I have already graded the other assignments, grading late assignments gets a low priority).

I use the time I receive your e-mail, not the time you send your e-mail as the turn in time (sometimes e-mail from an ISP takes a day or two to be delivered).  If you want to be absolutely sure your assignment gets to me immediately, e-mail it from your csuchico or ecst account (you can use your browser (webmail.ecst.csuchico.edu) to e-mail the file to me).