Common Gateway Interface - Oakland University



Common Gateway Interface

[pic]

Overview

[pic]

The Common Gateway Interface (CGI) is a standard for interfacing external applications with information servers, such as HTTP or Web servers. A plain HTML document that the Web daemon retrieves is static, which means it exists in a constant state: a text file that doesn't change. A CGI program, on the other hand, is executed in real-time, so that it can output dynamic information.

For example, let's say that you wanted to "hook up" your Unix database to the World Wide Web, to allow people from all over the world to query it. Basically, you need to create a CGI program that the Web daemon will execute to transmit information to the database engine, and receive the results back again and display them to the client. This is an example of a gateway, and this is where CGI, currently version 1.1, got its origins.

The database example is a simple idea, but most of the time rather difficult to implement. There really is no limit as to what you can hook up to the Web. The only thing you need to remember is that whatever your CGI program does, it should not take too long to process. Otherwise, the user will just be staring at their browser waiting for something to happen.

[pic]

Specifics

[pic]

Since a CGI program is executable, it is basically the equivalent of letting the world run a program on your system, which isn't the safest thing to do. Therefore, there are some security precautions that need to be implemented when it comes to using CGI programs. Probably the one that will affect the typical Web user the most is the fact that CGI programs need to reside in a special directory, so that the Web server knows to execute the program rather than just display it to the browser. This directory is usually under direct control of the webmaster, prohibiting the average user from creating CGI programs. There are other ways to allow access to CGI scripts, but it is up to your webmaster to set these up for you. At this point, you may want to contact them about the feasibility of allowing CGI access.

If you have a version of the NCSA HTTPd server distribution, you will see a directory called /cgi-bin. This is the special directory mentioned above where all of your CGI programs currently reside. A CGI program can be written in any language that allows it to be executed on the system, such as:

• C/C++

• Fortran

• PERL

• TCL

• Any Unix shell

• Visual Basic

• AppleScript

It just depends what you have available on your system. If you use a programming language like C or Fortran, you know that you must compile the program before it will run. If you look in the /cgi-src directory that came with the server distribution, you will find the source code for some of the CGI programs in the /cgi-bin directory. If, however, you use one of the scripting languages instead, such as PERL, TCL, or a Unix shell, the script itself only needs to reside in the /cgi-bin directory, since there is no associated source code. Many people prefer to write CGI scripts instead of programs, since they are easier to debug, modify, and maintain than a typical compiled program.

After reading this document, you should have an overall idea of what a CGI program needs to do to function.

[pic]

How do I get information from the server?

Each time a client requests the URL corresponding to your CGI program, the server will execute it in real-time. The output of your program will go more or less directly to the client.

A common misconception about CGI is that you can send command-line options and arguments to your program, such as

command% myprog -qa blorf

CGI uses the command line for other purposes and thus this is not directly possible. Instead, CGI uses environment variables to send your program its parameters. The two major environment variables you will use for this purpose are:

• QUERY_STRING

QUERY_STRING is defined as anything which follows the first ? in the URL. This information could be added either by an ISINDEX document, or by an HTML form (with the GET action). It could also be manually embedded in an HTML anchor which references your gateway. This string will usually be an information query, i.e. what the user wants to search for in the archie databases, or perhaps the encoded results of your feedback GET form.

This string is encoded in the standard URL format of changing spaces to +, and encoding special characters with %xx hexadecimal encoding. You will need to decode it in order to use it.

If your gateway is not decoding results from a FORM, you will also get the query string decoded for you onto the command line. This means that each word of the query string will be in a different section of ARGV. For example, the query string "forms rule" would be given to your program with argv[1]="forms" and argv[2]="rule". If you choose to use this, you do not need to do any processing on the data before using it.

• PATH_INFO

CGI allows for extra information to be embedded in the URL for your gateway which can be used to transmit extra context-specific information to the scripts. This information is usually made available as "extra" information after the path of your gateway in the URL. This information is not encoded by the server in any way.

The most useful example of PATH_INFO is transmitting file locations to the CGI program. To illustrate this, let's say I have a CGI program on my server called /cgi-bin/foobar that can process files residing in the DocumentRoot of the server. I need to be able to tell foobar which file to process. By including extra path information to the end of the URL, foobar will know the location of the document relative to the DocumentRoot via the PATH_INFO environment variable, or the actual path to the document via the PATH_TRANSLATED environment variable which the server generates for you.

[pic]

How do I send my document back to the client?

I have found that the most common error in beginners' CGI programs is not properly formatting the output so the server can understand it.

CGI programs can return a myriad of document types. They can send back an image to the client, and HTML document, a plaintext document, or perhaps even an audio clip. They can also return references to other documents. The client must know what kind of document you're sending it so it can present it accordingly. In order for the client to know this, your CGI program must tell the server what type of document it is returning.

In order to tell the server what kind of document you are sending back, whether it be a full document or a reference to one, CGI requires you to place a short header on your output. This header is ASCII text, consisting of lines separated by either linefeeds or carriage returns (or both) followed by a single blank line. The output body then follows in whatever native format.

• A full document with a corresponding MIME type

In this case, you must tell the server what kind of document you will be outputting via a MIME type. Common MIME types are things such as text/html for HTML, and text/plain for straight ASCII text.

For example, to send back HTML to the client, your output should read:

Content-type: text/html

output of HTML from CGI script

Sample output

What do you think of this?

• A reference to another document

Instead of outputting the document, you can just tell the browser where to get the new one, or have the server automatically output the new one for you.

For example, say you want to reference a file on your Gopher server. In this case, you should know the full URL of what you want to reference and output something like:

Content-type: text/html

Location: gopher://httprules.0

Sorry...it moved

Go to gopher instead

Now available at

a new location

on our gopher server.

However, today's browsers are smart enough to automatically throw you to the new document, without ever seeing the above since. If you get lazy and don't want to output the above HTML, NCSA HTTPd will output a default one for you to support older browsers.

If you want to reference another file (not protected by access authentication) on your own server, you don't have to do nearly as much work. Just output a partial (virtual) URL, such as the following:

Location: /dir1/dir2/myfile.html

The server will act as if the client had not requested your script, but instead requested . It will take care of most everything, such as looking up the file type and sending the appropriate headers. Just be sure that you output the second blank line.

If you do want to reference a document that is protected by access authentication, you will need to have a full URL in the Location:, since the client and the server need to re-transact to establish that you access to the referenced document.

Advanced usage: If you would like to output headers such as Expires or Content-encoding, you can if your server is compatible with CGI/1.1. Just output them along with Location or Content-type and they will be sent back to the client.

Where do I get the form data from?

As you now know, there are two methods which can be used to access your forms. These methods are GET and POST. Depending on which method you used, you will receive the encoded results of the form in a different way.

• The GET method

If your form has METHOD="GET" in its FORM tag, your CGI program will receive the encoded form input in the environment variable QUERY_STRING.

• The POST method

If your form has METHOD="POST" in its FORM tag, your CGI program will receive the encoded form input on stdin. The server will NOT send you an EOF on the end of the data, instead you should use the environment variable CONTENT_LENGTH to determine how much data you should read from stdin.

[pic]

But what does it all mean? How do I decode the form data?

When you write a form, each of your input items has a NAME tag. When the user places data in these items in the form, that information is encoded into the form data. The value each of the input items is given by the user is called the value.

Form data is a stream of name=value pairs separated by the & character. Each name=value pair is URL encoded, i.e. spaces are changed into plusses and some characters are encoded into hexadecimal.

Because others have been presented with this problem as well, there are already a number of programs which will do this decoding for you. The following are links into the CGI archive, clicking on them will retrieve the software package being referred to.

• The Bourne Shell: The AA archie gateway. Contains calls to sed and awk which convert a GET form data string into separate environment variables.

• C: The default scripts for NCSA httpd. While I won't win any awards for verbosity in documenting my code, there are C routines and example programs you can use to translate the query string into a group of structures.

• PERL: The PERL CGI-lib. This package contains a group of useful PERL routines to decode forms.

• PERL5: CGI.pm A perl5 library for handling forms in CGI scripts. With just a handful of calls, you can parse CGI queries, create forms, and maintain the state of the buttons on the form from invocation to invocation.

• TCL: TCL argument processor. This is a set of TCL routines to retrieve form data and place it into TCL variables.

The basic procedure is to split the data by the ampersands. Then, for each name=value pair you get for this, you should URL decode the name, and then the value, and then do what you like with them.

Writing secure CGI scripts

Any time that a program is interacting with a networked client, there is the possibility of that client attacking the program to gain unauthorized access. Even the most innocent looking script can be very dangerous to the integrity of your system.

With that in mind, we would like to present a few guidelines to making sure your program does not come under attack.

[pic]

• Beware the eval statement

Languages like PERL and the Bourne shell provide an eval command which allow you to construct a string and have the interpreter execute that string. This can be very dangerous. Observe the following statement in the Bourne shell:

eval `echo $QUERY_STRING | awk 'BEGIN{RS="&"} {printf "QS_%s\n",$1}' `

This clever little snippet takes the query string, and convents it into a set of variable set commands. Unfortunately, this script can be attacked by sending it a query string which starts with a ;. See what I mean about innocent-looking scripts being dangerous?

• Do not trust the client to do anything

A well-behaved client will escape any characters which have special meaning to the Bourne shell in a query string and thus avoid problems with your script misinterpreting the characters. A mischevious client may use special characters to confuse your script and gain unauthorized access.

• Be careful with popen and system.

If you use any data from the client to construct a command line for a call to popen() or system(), be sure to place backslashes before any characters that have special meaning to the Bourne shell before calling the function. This can be achieved easily with a short C function.

• Turn off server-side includes

If your server is unfortunate enough to support server-side includes, turn them off for your script directories!!!. The server-side includes can be abused by clients which prey on scripts which directly output things they have been sent.

For a more comprehensive summary of security and the World-Wide Web, see the WWW Security FAQ.

|Call of a CGI script file |

An anchor tag to execute the CGI script dynamic_page on the server is:

|Dynamic page |

When the web server process a request to fetch a file, if the requested file is in the servers nominated cgi-bin directory then as long as this file is marked as being executable the script will be run on the server. If the file is not executable then an error will be reported.

The script eventually returns an HTML page or image to be displayed as the result of its execution. When a CGI script file executes it may access environment variables to discover additional information about the process that it is to perform. The first line of the returned data must be:

|Type of returned data |Text |

|An HTML page |Content-type: text/html |

|A gif image |Content-type: image/gif |

| | |

A simple CGI script on a unix based web server to return a list of the current users who are logged onto the web server is as follows: is:

| |Remember: |

|#!/bin/sh |The "'s around text with a < or > character. |

|echo Content-type: text/html |On a Unix system: |

|echo |The first line is #!/bin/sh |

|echo |The file has executable permission set. |

|echo "" | |

|echo "" | |

|echo "" | |

|echo "" | |

|echo "Users logged in are:" | |

|echo "" | |

|who | |

|echo "" | |

|echo "" | |

|echo "" | |

Note:

The JCL (Job Control Language) command echo echoes the rest of the line to the standard output

The JCL command who lists the current users who are logged onto the system.

Allowing users to create their own CGI scripts can lead to security problems on the server.

The major environment variables that can be accessed by the CGI script when it executes are:

|Environment variable |Contains |

|QUERY_STRING |Data sent to the CGI script, by its caller. This may be the output from a|

| |form, or other dynamically or statically generated data. |

|REMOTE_ADDR |The Internet address of the host machine making the request. |

A C++ program mas_env.cpp (Other files required: mas_cvo.cpp, t99_type.h ) when run prints many of the environment variables available to a CGI script.

CGI scripts can be written in any language. For example, a CGI script to return the contents of the environment variable QUERY_STRING can be written in Ada 95.

Note:

I used the gcc compiler version 2.7.0 to compile this source code. In particular this compiler recognises the new data type bool.

|  |

|[pic]Decoding data sent to a CGI script |

When a form is used, the information collected in the form is sent to the CGI script for processing. This information is placed in the environment variable QUERY_STRING.

To pass information explicitly to the environment variable QUERY_STRING a modified form of an anchor tag is used. In this modified anchor tag, the data to be sent to the environment variable QUERY_STRING is appended after the URL which denotes the CGI script. The character ? is used to separate the URL denoting the CGI script and the data that is to be sent to the script. For example:

| Link |

The data "name=Your+name&action=find" is placed in the environment variable QUERY_STRING and the cgi script script executed.

A class written in C++ composed of the specification parse.h and implementation parse.cpp is used to extract the individual components in the QUERY_STRING . The header file t99_type.h contains definitions for C++ features not implemented in some compilers. The members of this class are:

|Method |Responsibility |

|Parse |Set the string that will be parsed. |

|set |Set a different string to be parsed. |

|get_item |Return the string associated with the keyword passed as a parameter. If no data return|

| |NULL. |

|get_item_n |Return the string associated with the keyword passed as a parameter. If no data then |

| |return the null string. |

When using the member functions get_item and get_item_n the optional second parameter specifies which occurrence of the string associated with a keyword to return. This is to allow the recovery of information attached to identical keywords. In addition the returned string will have had the following substitutions made on it.

• +

Will be converted to a space.

• %HH

Will be converted to the character whose hexadecimal value is HH.

• ~user

Will be replaced by the full path to the user's home directory, but only if the optional third parameter is true.

Note:

The definition of NO_MAP will cause the code for ~username processing to be not included. This is so that the code can be compiled for machines, which do not support the system function map_uname defined in the header file cgi/pwd.h.

For example, if the QUERY_STRING contained:

|tag=one&name=mike&action=%2B10%25&tag=two&log=~mas/log&tag=three |

Then the following program when compiled and run:

|enum bool { false, true }; |

| |

|#include |

|#include |

| |

|#include "parse.h" |

|#include "parse.cpp" |

| |

|void main() |

|{ |

|char *query_str = getenv("QUERY_STRING"); |

| |

|Parse list( query_str ); |

| |

|cout ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download