CSCI399
Autumn Session, 2009
Assignment 2: Perl
You should complete exercise 2 before atempting this assignment.
This assignment aims to introduce you to the Perl language and some of its applications.
The objectives are for you to:
Resources
We will use the standard perl system available on Ubuntu. This
is in /usr/bin/perl; it is a Version 5.8 level Perl.
The perldoc utility has also been installed; you can get documentation
on a module such as the DBI (database access) module via the command perldoc DBI
while information on standard perl functions can be obtained via perldoc perlfunc.
(The /share/cs-pub/399 directory on the main banshee server contains ActivePerl's HTML
version of the Perl documentation.
This is a more convenient format that the normal perldoc
documentation. It includes all the basic chapters on
the language, operators, standard functions, and many of the tutorials.)
The second and third parts of the assignment use tables in the Oracle database. All students who were enrolled in CSCI399 at the end of the official enrollment period will have Oracle accounts; late enrollees will have to apply for accounts.
The Oracle client software has been installed on the Ubuntu machines; the client software includes the sqlplus utility program that is convenient for creating tables and performing other simple database administration tasks.
Oracle uses a number of environment variables. These must be set correctly before you can use sqlplus or use the DBD-Oracle driver in Perl. You will need to modify your ".profile" file on the Ubuntu systems as follows:
PATH=$PATH:/usr/local/instantclient_10_2 SQLPATH=/usr/local/instantclient_10_2 LD_LIBRARY_PATH=/usr/local/instantclient_10_2 ORACLE_HOME=/usr/local/instantclient_10_2 TNS_ADMIN=/usr/local/instantclient_10_2 export PATH export SQLPATH export LD_LIBRARY_PATH export TNS_ADMIN export ORACLE_HOME
When using sqlplus to connect to Oracle, you must specify your userid and the Oracle server name:
sqlplus hsimpson@csci
You should get a connection and be prompted for your password.
(As of January, there was some problem relating to the LD_LIBRARY_PATH environment variable. It was then necessary to define LD_LIBRARY_PATH and export this environment variable from within the current shell session prior to invoking sqlplus or using a Perl program that accessed the database. The setting in the .profile file was not being honored. This may have been fixed by the time you are attempting the assignment.)
You will use the Zend/Oracle/Apache web server that was used in assignment 1. Your Perl CGI scripts will be placed in your public_html directory (or a cgi-bin subdirectory). They will need to be renamed to have the .cgi file extension. Connection to Oracle will require that the program set some environment variables as detailed below.
This exercise involves "regexs" and simple data processing.
A program, "imagelink.pl", is to be developed that will extract information about all image links in a HTML page.
The HTML page is to have been saved, as source, to a local file. The imagelink.pl program is to be invoked with a command line argument identifying the file that is to be processed. It is to find all image links (<img ... >) and to print in tabular form details of the image source (src attribute) and alternative text (alt attribute); if an image has no alt attribute, the string --- should be printed for its alternative text.
The program is also to keep counts of the number of references to each different image type - gif, png, jpg, etc. A summary showing the frequency of each different type is to be printed before the program terminates.
The program is to extract the required data using regex matching. It will probably be easiest to work by first extracting the contents of a complete <img ... > element, and then using other regexs extract src and alt data. (Note that src attributes can occur in tags other than img tags.)
The directory /share/cs-pub/399/A2 contains a couple of data files (simpleimagelinks.txt and pbaseexample.txt) that can be used when testing. The output for the example file pbaseexample.txt should appear similar to the following:
http://i.pbase.com/site/m_pbase.gif photo sharing and upload http://i.pbase.com/site/m_g.gif picture albums http://i.pbase.com/site/m_f.gif photo forums http://i.pbase.com/site/m_s.gif search pictures http://i.pbase.com/site/m_sp.gif popular photos http://i.pbase.com/site/m_h.gif photography help http://i.pbase.com/site/m_l.gif login http://thump01.pbase.com/t3/17/869017/4/108734989.EtdWiXLY.jpg suitcase http://i.pbase.com/t1/24/407324/4/109094680.dsXX9iTN.jpg 4th month http://i.pbase.com/t1/04/283704/4/109094453.TLdualn5.jpg wedding http://i.pbase.com/v3/43/37043/4/50765370.PA151794.jpg Fighters For Sale http://thump01.pbase.com/t1/41/570641/4/109094669.cKfCRPsY.jpg Photo A Day 2009 http://thump01.pbase.com/t3/51/896551/4/109094668.tc3Fvzgx.jpg halloween 2008 http://i.pbase.com/t1/04/389604/4/108747357.e4eaLBRf.jpg Ho Chi Minh City Sketchbook http://thump01.pbase.com/t1/34/778634/4/108630378.RXdsSQj9.jpg Theodore Roosevelt National Park http://i.pbase.com/t1/11/80511/4/109025666.qj6wdoiN.jpg Wild Thing ~ You make my heart sing '09 http://thump01.pbase.com/t1/05/677405/4/103909998.rGtQhn3X.jpg Misc http://i.pbase.com/t1/75/283775/4/109094821.yVJpKHeE.jpg More San Francisco Treats http://i.pbase.com/t1/32/183732/4/109092717.QSzYPxtF.jpg Misc. Photos Frequency of different image types gif 7 jpg 12
It is often easy to get data in tabular format - but not quite the data or the format that is required. The /share/cs-pub/399/A2 directory contains files enrol_inf.txt and subjects_inf.txt that contain details of enrolments in Informatics subjects and subject titles etc.
The enrolment data has the following form
Class Id Year Campus Delivery Session Unit Faculty Subject Code Class Name Enr Wtd Total 158142 2009 Woll On Campus Spring MAAS Inf MATH161 Class 1 26 4 30 158143 2009 Woll On Campus Summer 2009/2010 MAAS Inf MATH162 Class 1 0 0 0 158155 2009 Woll On Campus Spring MAAS Inf MATH212 Class 1 9 2 11
The fields in this file are separated by tabs. The first line contains column identification data; the remaining lines relate to individual subjects. The "Class Id" field is a unique primary key identifier. The fields Enr, and Wtd give the number of enrolled and withdrawn students in a class. The other fields have obvious interpretations. These enrolment data were recorded early in 2009. Most students had not enrolled in Spring session subjects so these have lower than expected enrolments. Many post-graduate students had not yet enrolled and consequently there are quite a few subjects, mainly at graduate level, with zero enrolments.
The subjects_inf.txt file contains data like:
ISIT925 Strategic Network Design 6 ISIT929 Concepts and Issues in Healthcare Computing 6 ISIT930 Introduction to Health Informatics 6
The fields are again separated by tab characters; they contain the subject code, name, and number of credit points. It is possible that the file may contain duplicate entries and there may be omissions where a "Subject Code" that appears in the enrolment file does not have an entry in the subject information file.
You are to implement a program 'enrolmentreport.pl' that:
The following fragment shows the expected output for the given data files:
Don't know about GCT9301 ... Don't know about GCT9344 158161 MATH283 395 Mathematics IIE for Engineers Part 1 158136 MATH141 297 Foundations of Engineering Mathematics 158138 MATH142 244 Essentials of Engineering Mathematics 158169 STAT151 235 Fundamentals of Biostatistics 162209 ISIT102 173 Information Systems 158134 MATH121 144 Discrete Mathematics 158146 MATH187 128 Mathematics 1: Algebra and Differential Calculus 161743 CSCI212 121 Interacting Systems ... 162373 ISIT218 51 Systems Design and Human Computer Interaction
Perl is very useful when you require a program to perform a few simple database operations. Generally, you can "whip up" a Perl program for DB access much faster than an equivalent Java program (C/C++? - forget it!). Database usage is similar to Java's JDBC, though the names of the various components are changed. You get a "database handle" (~java.sql.Connection), use this to prepare statement handles (~java.sql.Statement or java.sql.PreparedStatment) that have SQL that can be executed. A "select" operation returns a form of iterator that allows you to obtain successive rows.
Almost all web applications involve database access. It is often worthwhile writing some Perl scripts to check out the operations and get the SQL queries right before you implement the main web application.
In this part of the assignment you create and populate tables for the subject information and enrolment data and perform some simple selection operations on these tables. Your table definition can be as follows (these are for Oracle):
DROP TABLE INFO; DROP TABLE ENROLMENT; CREATE TABLE INFO ( SUBJECTCODE VARCHAR2(10) primary key, TITLE VARCHAR2(256), CREDITPTS NUMBER(4) ); CREATE TABLE ENROLMENT ( SUBJECTID VARCHAR2(10) primary key, SCODE VARCHAR2(10), ENROLMENT NUMBER(4), SESSIONON VARCHAR2(32), UNIT VARCHAR2(8), FACULTY VARCHAR2(8), CONSTRAINT fk_subject FOREIGN KEY(SCODE) REFERENCES INFO(SUBJECTCODE) ); COMMIT;
Your Perl application will:
The directory /share/cs-pub/399/A2 contains a number of subject and enrolment files for different faculties - use these to populate your tables.
The following outputs provide an illustration of the operation of your Perl DBI program:
$ ./PerlDBI.pl enrol_law.txt subjects_law.txt Connected to oracle Uploading data Enrolments file enrol_law.txt Subject info file subjects_law.txt Don't know about CMP 905 Don't know about LLB 337 Running search queries Enter SQL select query or Quit : select count(*) from enrolment 1003 Enter SQL select query or Quit : select max(enrolment) from enrolment 438 Enter SQL select query or Quit : select max(enrolment) from enrolment where faculty='Edu' Enter SQL select query or Quit : quit Disconnected from oracle $ ./PerlDBI.pl enrol_edu.txt subjects_edu.txt Connected to oracle Uploading data Enrolments file enrol_edu.txt Subject info file subjects_edu.txt Don't know about EDGS920 Don't know about EDGC802 Running search queries Enter SQL select query or Quit : select max(enrolment) from enrolment 438 Enter SQL select query or Quit : select max(enrolment) from enrolment where faculty='Edu' 320 Enter SQL select query or Quit : select avg(enrolment) from enrolment where faculty='Edu' 40.90625 Enter SQL select query or Quit : select title from infor where subjectcode in (select scode from enrolment where enrolment>80 and faculty='Edu') DBD::Oracle::db prepare failed: ORA-00942: table or view does not exist (DBD ERROR: OCIStmtExecute/Describe) at ./PerlDBI.pl line 69,line 5. Invalid sql query; try again Enter SQL select query or Quit : select title from info where subjectcode in (select scode from enrolment where enrolment>80 and faculty='Edu') Aboriginal Education Classroom Management: Creating positive learning environments Educational Research and Action Learning Education Foundations 1: Learning and Development ... Professional Studies 2 Teaching for Diversity Enter SQL select query or Quit :
(The numbers that you get for your queries will naturally depend on how many of the enrolment files you have loaded into the data tables.)
It is OK for developers to type in SQL select statements - but this is unacceptable to real users. Users, e.g. administrative staff trying to access data on student enrolments, require simple forms that they can fill in with requests for data, and they need well formatted outputs. These days, a web-based solution is almost always the most appropriate.
So in this final part of the assignment you will develop a small web application that will allow such an administrative staff user to access the enrolment data constructed in the preceding part of the assignment. (This exercise is simplified - only read access is required to the data.)
The web application consists of a single static HTML form page (with a little Javascript) and a Perl CGI program that will process submitted requests.
The form will have input fields for faculty, unit, and "year" (subject level):
Each of these inputs has an "Any" option; the faculty input has options for the various faculties; the unit input has the teaching units; the year input has options 100-level, 200-level, 300-level, 400-level, and 900-level.
The form is to have two Javascript functions associated with it. An "on submit" function will check that some selection has been made in the input fields - they cannot all be "Any". If no selections were made, this Javascript checking function puts up an alert dialog and prevents submission of invalid data.
A second Javascript function will be invoked for an "onChanged" event in the faculty selection - this function will adjust the contents of the unit selection to include only appropriate teaching units.
The Perl processing script referenced by the form should be placed in your public_html directory along with the HTML form page; its file extension should be .cgi not .pl.
Your script should validate the inputs - you must learn to be paranoid, some hacker may submitting hand crafted Post requests with data intended to crash your application. (Hackers apparently have infinite time to waste on such things.) The checks should include a further test that the inputs do not all have the value "Any" (you cannot rely on client side checks - they reduce innocent errors but don't prevent deliberate attacks). A simple error report page should be generated if the request appears invalid (or if an attempt to connect to the database fails). An SQL query should be composed based on the validated input data; this should retrieve subject names and credit points from the info table, and enrolment details from the enrolment table (this can be done as a single query - ask Google about SQL select join statements). If no data are retrieved for the query (e.g. nothing loaded into database for the chosen faculty)then a simple error report should be generated. If data are retrieved, they should be presented to the user in tabular form as illustrated in the following image:
(The EFTS column represents the teaching load for a subject = number-enrolled*credit-points/48.)
Since your script is a CGI program, it will run in some odd environment defined by the web server. The program will need to set the environment variables needed by Oracle. The following environment variable settings appear to be needed:
$ENV{"ORACLE_HOME"} = "/usr/local/instantclient_10_2";
$ENV{"LD_LIBRARY_PATH"} = "/usr/local/instantclient_10_2";
$ENV{"TNS_ADMIN"} = "/usr/local/instantclient_10_2";
$ENV{"ORACLE_SID"} = "csci";
$ENV{"TWO_TASK"} = "csci";
You will also have to remember to set the appropriate usage permissions on the files in your public_html directory (and remember that you should disable access when you have finished testing lest lazy colleagues try to steal your script code!)
You are to prepare a report detailing your work in a word processor and convert this to a PDF file for submission.
For each section, you will include your Perl code and evidence (captured console I/O, screen shots, etc) for the correct operation of your code. You should include the source HTML of your form page as used in the third part.
The due date for submission will be announced in lectures; currently the submission date is set to be April 24th.
For CSCI399, asignments are submitted
electronically via the turnin system. For
this assignment you submit your report via the command:
turnin -c csci399 -a 2 A2.pdf
Remember, turnin only works when you are logged in to the
main banshee undergraduate server machine.
This information is primarily for the tutors who will be marking the assignment; but some of you might be interested.
Mark to 0.5 divisions, no 0.1 fractional marks.