CSCI399
Autumn Session, 2009

Assignment 2: Perl

You should complete exercise 2 before atempting this assignment.


Aims

This assignment aims to introduce you to the Perl language and some of its applications.


Objectives

The objectives are for you to:


Resources

Perl

We will use the standard perl system available on Ubuntu. This is in /usr/bin/perl; it is a Version 5.8 level Perl. The perldoc utility has also been installed; you can get documentation on a module such as the DBI (database access) module via the command perldoc DBI while information on standard perl functions can be obtained via perldoc perlfunc. (The /share/cs-pub/399 directory on the main banshee server contains ActivePerl's HTML version of the Perl documentation. This is a more convenient format that the normal perldoc documentation. It includes all the basic chapters on the language, operators, standard functions, and many of the tutorials.)

Oracle database

The second and third parts of the assignment use tables in the Oracle database. All students who were enrolled in CSCI399 at the end of the official enrollment period will have Oracle accounts; late enrollees will have to apply for accounts.

The Oracle client software has been installed on the Ubuntu machines; the client software includes the sqlplus utility program that is convenient for creating tables and performing other simple database administration tasks.

Oracle uses a number of environment variables. These must be set correctly before you can use sqlplus or use the DBD-Oracle driver in Perl. You will need to modify your ".profile" file on the Ubuntu systems as follows:

  1. Create a file with the following commands:
    PATH=$PATH:/usr/local/instantclient_10_2
    SQLPATH=/usr/local/instantclient_10_2
    LD_LIBRARY_PATH=/usr/local/instantclient_10_2
    ORACLE_HOME=/usr/local/instantclient_10_2
    TNS_ADMIN=/usr/local/instantclient_10_2
    export PATH
    export SQLPATH
    export LD_LIBRARY_PATH
    export TNS_ADMIN
    export ORACLE_HOME
    
  2. Append that file to your profile file (.profile or .bashrc or .bash_profile or whatever appropriate).

When using sqlplus to connect to Oracle, you must specify your userid and the Oracle server name:

sqlplus hsimpson@csci

You should get a connection and be prompted for your password.

(As of January, there was some problem relating to the LD_LIBRARY_PATH environment variable. It was then necessary to define LD_LIBRARY_PATH and export this environment variable from within the current shell session prior to invoking sqlplus or using a Perl program that accessed the database. The setting in the .profile file was not being honored. This may have been fixed by the time you are attempting the assignment.)

Web server

You will use the Zend/Oracle/Apache web server that was used in assignment 1. Your Perl CGI scripts will be placed in your public_html directory (or a cgi-bin subdirectory). They will need to be renamed to have the .cgi file extension. Connection to Oracle will require that the program set some environment variables as detailed below.


Tasks

Extraction and Reporting

Image link reporting

This exercise involves "regexs" and simple data processing.

A program, "imagelink.pl", is to be developed that will extract information about all image links in a HTML page.

The HTML page is to have been saved, as source, to a local file. The imagelink.pl program is to be invoked with a command line argument identifying the file that is to be processed. It is to find all image links (<img ... >) and to print in tabular form details of the image source (src attribute) and alternative text (alt attribute); if an image has no alt attribute, the string --- should be printed for its alternative text.

The program is also to keep counts of the number of references to each different image type - gif, png, jpg, etc. A summary showing the frequency of each different type is to be printed before the program terminates.

The program is to extract the required data using regex matching. It will probably be easiest to work by first extracting the contents of a complete <img ... > element, and then using other regexs extract src and alt data. (Note that src attributes can occur in tags other than img tags.)

The directory /share/cs-pub/399/A2 contains a couple of data files (simpleimagelinks.txt and pbaseexample.txt) that can be used when testing. The output for the example file pbaseexample.txt should appear similar to the following:

http://i.pbase.com/site/m_pbase.gif                                    photo sharing and upload
http://i.pbase.com/site/m_g.gif                                        picture albums
http://i.pbase.com/site/m_f.gif                                        photo forums
http://i.pbase.com/site/m_s.gif                                        search pictures
http://i.pbase.com/site/m_sp.gif                                       popular photos
http://i.pbase.com/site/m_h.gif                                        photography help
http://i.pbase.com/site/m_l.gif                                        login
http://thump01.pbase.com/t3/17/869017/4/108734989.EtdWiXLY.jpg         suitcase
http://i.pbase.com/t1/24/407324/4/109094680.dsXX9iTN.jpg               4th month
http://i.pbase.com/t1/04/283704/4/109094453.TLdualn5.jpg               wedding
http://i.pbase.com/v3/43/37043/4/50765370.PA151794.jpg                 Fighters For Sale
http://thump01.pbase.com/t1/41/570641/4/109094669.cKfCRPsY.jpg         Photo A Day 2009
http://thump01.pbase.com/t3/51/896551/4/109094668.tc3Fvzgx.jpg         halloween 2008
http://i.pbase.com/t1/04/389604/4/108747357.e4eaLBRf.jpg               Ho Chi Minh City Sketchbook
http://thump01.pbase.com/t1/34/778634/4/108630378.RXdsSQj9.jpg         Theodore Roosevelt National Park
http://i.pbase.com/t1/11/80511/4/109025666.qj6wdoiN.jpg                Wild Thing ~ You make my heart sing   '09
http://thump01.pbase.com/t1/05/677405/4/103909998.rGtQhn3X.jpg         Misc
http://i.pbase.com/t1/75/283775/4/109094821.yVJpKHeE.jpg               More San Francisco Treats
http://i.pbase.com/t1/32/183732/4/109092717.QSzYPxtF.jpg               Misc. Photos
Frequency of different image types
gif     7
jpg     12
Enrolment files

It is often easy to get data in tabular format - but not quite the data or the format that is required. The /share/cs-pub/399/A2 directory contains files enrol_inf.txt and subjects_inf.txt that contain details of enrolments in Informatics subjects and subject titles etc.

The enrolment data has the following form

Class Id	Year	Campus	Delivery	Session	Unit	Faculty	Subject Code	Class Name	Enr	Wtd	Total
158142	2009	Woll	On Campus	Spring	MAAS	Inf	MATH161	Class 1	26	4	30
158143	2009	Woll	On Campus	Summer 2009/2010	MAAS	Inf	MATH162	Class 1	0	0	0
158155	2009	Woll	On Campus	Spring	MAAS	Inf	MATH212	Class 1	9	2	11

The fields in this file are separated by tabs. The first line contains column identification data; the remaining lines relate to individual subjects. The "Class Id" field is a unique primary key identifier. The fields Enr, and Wtd give the number of enrolled and withdrawn students in a class. The other fields have obvious interpretations. These enrolment data were recorded early in 2009. Most students had not enrolled in Spring session subjects so these have lower than expected enrolments. Many post-graduate students had not yet enrolled and consequently there are quite a few subjects, mainly at graduate level, with zero enrolments.

The subjects_inf.txt file contains data like:

ISIT925	Strategic Network Design	6
ISIT929	Concepts and Issues in Healthcare Computing	6
ISIT930	Introduction to Health Informatics	6

The fields are again separated by tab characters; they contain the subject code, name, and number of credit points. It is possible that the file may contain duplicate entries and there may be omissions where a "Subject Code" that appears in the enrolment file does not have an entry in the subject information file.

You are to implement a program 'enrolmentreport.pl' that:

  1. Opens and reads an enrolment file, and a subject information file, both filenames being provided as command line arguments;
  2. Identifies any subject with a non-zero enrolment that has no entry in the subject information file;
  3. Prints a report listing unique subject identifier, subject code, enrolments, and subject title for the fifty subjects having highest enrolments (largest enrolment subject listed first).

The following fragment shows the expected output for the given data files:

Don't know about GCT9301
...
Don't know about GCT9344
158161  MATH283 395     Mathematics IIE for Engineers Part 1
158136  MATH141 297     Foundations of Engineering Mathematics
158138  MATH142 244     Essentials of Engineering Mathematics
158169  STAT151 235     Fundamentals of Biostatistics
162209  ISIT102 173     Information Systems
158134  MATH121 144     Discrete Mathematics
158146  MATH187 128     Mathematics 1: Algebra and Differential Calculus
161743  CSCI212 121     Interacting Systems
...
162373  ISIT218 51      Systems Design and Human Computer Interaction

Using the DBI module

Perl is very useful when you require a program to perform a few simple database operations. Generally, you can "whip up" a Perl program for DB access much faster than an equivalent Java program (C/C++? - forget it!). Database usage is similar to Java's JDBC, though the names of the various components are changed. You get a "database handle" (~java.sql.Connection), use this to prepare statement handles (~java.sql.Statement or java.sql.PreparedStatment) that have SQL that can be executed. A "select" operation returns a form of iterator that allows you to obtain successive rows.

Almost all web applications involve database access. It is often worthwhile writing some Perl scripts to check out the operations and get the SQL queries right before you implement the main web application.

In this part of the assignment you create and populate tables for the subject information and enrolment data and perform some simple selection operations on these tables. Your table definition can be as follows (these are for Oracle):

DROP TABLE INFO;
DROP TABLE ENROLMENT;

CREATE TABLE INFO 
(
	SUBJECTCODE		VARCHAR2(10) primary key,
	TITLE			VARCHAR2(256),
	CREDITPTS		NUMBER(4)
);

CREATE TABLE ENROLMENT 
(
	SUBJECTID		VARCHAR2(10) primary key,
	SCODE			VARCHAR2(10),
	ENROLMENT		NUMBER(4),
	SESSIONON		VARCHAR2(32),
	UNIT			VARCHAR2(8),
	FACULTY		VARCHAR2(8),
	CONSTRAINT 	fk_subject FOREIGN KEY(SCODE) 
		REFERENCES INFO(SUBJECTCODE)
);

COMMIT;

Your Perl application will:

  1. Optionally take two filenames as command line arguments; if these are both given they identify files with additional data that are to be added to tables. Your program will process all the data in the files, creating new rows in the tables.
  2. The program will then loop, prompting for and processing SQL queries. The user should enter a syntactically correct SQL select statement; the "prepare statement" step in the program will fail if the SQL is invalid - in this case the program prints the error message from Oracle and reprompts the user. If the statement is successfully prepared, it is executed and the contents of each row in the result set are presented in a report to the user.
  3. The program terminates when the user enters the word "quit" in response to a prompt for an SQL query.

The directory /share/cs-pub/399/A2 contains a number of subject and enrolment files for different faculties - use these to populate your tables.

The following outputs provide an illustration of the operation of your Perl DBI program:

$ ./PerlDBI.pl enrol_law.txt subjects_law.txt 
Connected to oracle
Uploading data
Enrolments file enrol_law.txt
Subject info file subjects_law.txt
Don't know about CMP 905
Don't know about LLB 337
Running search queries
Enter SQL select query or Quit : select count(*) from enrolment
1003
Enter SQL select query or Quit : select max(enrolment) from enrolment
438
Enter SQL select query or Quit : select max(enrolment) from enrolment where faculty='Edu'

Enter SQL select query or Quit : quit
Disconnected from oracle
$ ./PerlDBI.pl enrol_edu.txt subjects_edu.txt 
Connected to oracle
Uploading data
Enrolments file enrol_edu.txt
Subject info file subjects_edu.txt
Don't know about EDGS920
Don't know about EDGC802
Running search queries
Enter SQL select query or Quit : select max(enrolment) from enrolment
438
Enter SQL select query or Quit : select max(enrolment) from enrolment where faculty='Edu'
320
Enter SQL select query or Quit : select avg(enrolment) from enrolment where faculty='Edu'
40.90625
Enter SQL select query or Quit : select title from infor where subjectcode in (select scode from enrolment where enrolment>80 and faculty='Edu')
DBD::Oracle::db prepare failed: ORA-00942: table or view does not exist (DBD ERROR: OCIStmtExecute/Describe) at ./PerlDBI.pl line 69,  line 5.
Invalid sql query; try again
Enter SQL select query or Quit : select title from info where subjectcode in (select scode from enrolment where enrolment>80 and faculty='Edu')
Aboriginal Education
Classroom Management: Creating positive learning environments
Educational Research and Action Learning
Education Foundations 1: Learning and Development
...
Professional Studies 2
Teaching for Diversity
Enter SQL select query or Quit : 

(The numbers that you get for your queries will naturally depend on how many of the enrolment files you have loaded into the data tables.)


Simple CGI in Perl

It is OK for developers to type in SQL select statements - but this is unacceptable to real users. Users, e.g. administrative staff trying to access data on student enrolments, require simple forms that they can fill in with requests for data, and they need well formatted outputs. These days, a web-based solution is almost always the most appropriate.

So in this final part of the assignment you will develop a small web application that will allow such an administrative staff user to access the enrolment data constructed in the preceding part of the assignment. (This exercise is simplified - only read access is required to the data.)

The web application consists of a single static HTML form page (with a little Javascript) and a Perl CGI program that will process submitted requests.

The form will have input fields for faculty, unit, and "year" (subject level):

A simple request form

Each of these inputs has an "Any" option; the faculty input has options for the various faculties; the unit input has the teaching units; the year input has options 100-level, 200-level, 300-level, 400-level, and 900-level.

The form is to have two Javascript functions associated with it. An "on submit" function will check that some selection has been made in the input fields - they cannot all be "Any". If no selections were made, this Javascript checking function puts up an alert dialog and prevents submission of invalid data.

Error report if no selection made

A second Javascript function will be invoked for an "onChanged" event in the faculty selection - this function will adjust the contents of the unit selection to include only appropriate teaching units.

Unit options adjusted to match faculty selection

The Perl processing script referenced by the form should be placed in your public_html directory along with the HTML form page; its file extension should be .cgi not .pl.

Your script should validate the inputs - you must learn to be paranoid, some hacker may submitting hand crafted Post requests with data intended to crash your application. (Hackers apparently have infinite time to waste on such things.) The checks should include a further test that the inputs do not all have the value "Any" (you cannot rely on client side checks - they reduce innocent errors but don't prevent deliberate attacks). A simple error report page should be generated if the request appears invalid (or if an attempt to connect to the database fails). An SQL query should be composed based on the validated input data; this should retrieve subject names and credit points from the info table, and enrolment details from the enrolment table (this can be done as a single query - ask Google about SQL select join statements). If no data are retrieved for the query (e.g. nothing loaded into database for the chosen faculty)then a simple error report should be generated. If data are retrieved, they should be presented to the user in tabular form as illustrated in the following image:

Results of search

(The EFTS column represents the teaching load for a subject = number-enrolled*credit-points/48.)

Since your script is a CGI program, it will run in some odd environment defined by the web server. The program will need to set the environment variables needed by Oracle. The following environment variable settings appear to be needed:

$ENV{"ORACLE_HOME"} = "/usr/local/instantclient_10_2";
$ENV{"LD_LIBRARY_PATH"} = "/usr/local/instantclient_10_2";
$ENV{"TNS_ADMIN"} = "/usr/local/instantclient_10_2";
$ENV{"ORACLE_SID"} = "csci";
$ENV{"TWO_TASK"} = "csci";

You will also have to remember to set the appropriate usage permissions on the files in your public_html directory (and remember that you should disable access when you have finished testing lest lazy colleagues try to steal your script code!)


Submission

You are to prepare a report detailing your work in a word processor and convert this to a PDF file for submission.

For each section, you will include your Perl code and evidence (captured console I/O, screen shots, etc) for the correct operation of your code. You should include the source HTML of your form page as used in the third part.

The due date for submission will be announced in lectures; currently the submission date is set to be April 24th.

For CSCI399, asignments are submitted electronically via the turnin system. For this assignment you submit your report via the command:

turnin -c csci399 -a 2 A2.pdf

Remember, turnin only works when you are logged in to the main banshee undergraduate server machine.


Marking

This information is primarily for the tutors who will be marking the assignment; but some of you might be interested.

Mark to 0.5 divisions, no 0.1 fractional marks.

Overall report
1 markQuality of presentation and clear organization.
Image link extraction
1 markIt has to use regexs to extract required fields.
Enrolment files
2 marksSensible Perl coding, not too long!
DBI exercise
3 marksThe usual - check the database access code and SQL statements. Look at quality of code - deduct marks if you find it difficult to follow flow of control.
CGIProgram and HTML
3 marksForm and Javascript 1.5, Perl and database query 1.5.