Configuring a WWW Site
Just about everyone on the planet knows about the World Wide Web. It's
the most talked-about aspect of the Internet. With the WWW's
popularity, more system users are getting into the game by setting up
their own WWW servers and home page. Sophisticated packages now act as
Web servers for many operating systems, although UNIX users have
always done it from scratch. Linux, based on UNIX, has the software
necessary to provide a Web server readily available.
You don't need fancy software to set up a Web site, only a little time
and the correct configuration information. That's what this chapter is
about. The chapter looks at how you can set up a World Wide Web server
on your Linux system, whether for frien ds, your LAN, or the Internet
as a whole.
The major aspect of the Web that attracts users and makes it so
powerful, aside from its multimedia capabilities, is the use of
hyperlinks. A hyperlink lets you move with only one mouse click from
document to document, site to site, graphic to m ovie, and so on. All
the instructions of the move are built into the Web code.
There are two aspects to the World Wide Web: server and client. Client
software is the best known, such as Mosaic and Netscape. However,
there are many different Web client packages available other than
these two, some specifically for X or Linux.
Web Server Software
There are three primary versions of Web server software that will run
under Linux. They are from NCSA, CERN, and Plexus. The most readily
available system is from NCSA, which also provides Mosaic. NCSA's Web
system is fast and quite small, can run unde r inetd or as a
stand-alone daemon, and provides pretty good security. This chapter
uses NCSA's Web software, although you can easily use any of the other
two packages instead (some of the configuration information will be
different, of course).
______________________________________________________________
NOTE: The Web server software is available via anonymous FTP or WWW
from one of the three sites listed following, depending on the type
of server software you want:
CERN: ftp://info.cern.ch.pub/www.bin (FTP)
NCSA: ftp.ncsa.edu (FTP)
http://boohoo.ncsa.uiuc.edu (WWW)
Plexus: ftp://autsin.bsdi.com/plexus/2.2.1/dist/Plexus.html (WWW)
______________________________________________________________
The NCSA Web software is available for Linux in both compiled and
source code forms. Using the compiled version is much easier because
you don't have to configure and compile the source code for the PC and
Linux platforms. The binaries are often provid ed compressed and
tarred, so you will have to uncompress and then extract the tar
library. Alternatively, many CD-ROMs provide the software ready-to-go.
If you do obtain the compressed form of the Web server software,
follow the installation or readme fil es to place the Web software in
the proper location.
Unpacking the Web Files
If you have obtained a library of source code or binaries from an FTP
or BBS site, you will probably have to untar and uncompress them
first. (Check with any README files before you do this, if there are
any; otherwise, you may be doing this step for n othing.) Usually, you
proceed by creating a directory for the Web software, then changing
into it and expanding the library with a command like this:
zcat httpd_X.X_XXX.tar.Z | tar xvf -
The software is often named by the release and target platform, such
as httpd_1.5_linux.tar.Z. Use whatever name your tar file has in the
above line. Installation instructions are sometimes in a separate tar
file, such as Install.tar.z, which you will have to obtain and
uncompress with the following command:
zcat Install.txt.z
Make sure you are in the target directory when you issue the commands
above, though, or you will have to move a lot of files. You can place
the files anywhere, although it is often a good idea to create a
special area for the Web software that can have its permissions
controlled, such as /usr/web, /var/web, or similar name.
Once you have extracted the contents of the Web server distribution
and the library files are in their proper directories, you can look at
what has been created automatically. You should have the following
subdirectories:
cgi-bin Common gateway interface binaries and scripts
conf Configuration files
icons Icons for home pages
src Source code and (sometimes) executables
support Support applications
Compiling the Web Software
If you don't have to modify the source code and recompile it under
Linux, you can skip the configuration details mentioned in the rest of
this section. On the other hand, you may want to know what is
happening in the source code anyway, because you can better understand
how Linux works with the Web server code. If you obtained a generic,
untailored version of the NCSA Web server, you will have to configure
the software.
Begin by editing the src/Makefile file to specify your platform. You
have to check several variables for proper information:
AUX_CFLAGS Uncomment the entry for Linux (identified by comment lines
and symbols, usually)
CC Specify the name of the C compiler (usually cc or gcc)
EXTRA_LIBS Add any extra libraries that need to be linked in (none are
required for Linux)
LFLAGS Add any flags you need for linking (none are required for most
Linux linkers)
Finally, look for the CFLAGS variable. Some of the values for CFLAGS
may be set already. Valid values for CFLAGS are as follows:
DESCURE_LOGS Prevents CGI scripts from interfering with any log files
written by the server software
DMAXIMUM_DNS Provides a more secure resolution system at the cost of
performance
DMINIMAL_DNS Doesn't allow reverse name resolution, but speeds up
performance
DNO_PASS Prevents multiple children from being spawned
DPEM_AUTH Enables PEM/PGP authentication schemes
DXBITHACK Provides a service check on the execute bit of an HTML file
O2 Is an optimizing flag
It is unlikely that you will need to change any of the flags in the
CFLAGS section, but at least you now know what they do. Once you have
checked the src/Makefile for its contents, you can compile the server
software. Change into the src direct ory and issue the command:
make
If you see error messages, check the configuration file carefully. The
most common problem is the wrong platform (or multiple platforms)
selected in the file.
Once the Web server software has been compiled, you have to compile
the support applications, too. Change into the support directory and
check the Makefile there. Once it is correct, issue the make command
again. Then, change to the cgi-src directory a nd repeat the process.
______________________________________________________________
NOTE: Some versions of NCSA Web server software (notably releases
1.4 or later) enable you to compile all three sets of source code
with the command make sgi from the Web directory.
______________________________________________________________
Configuring the Web Software
Once the software is in the proper directories and compiled for your
platform, it's time to configure the system. Begin with the
httpd.conf-dist file. This file handles the httpd server daemon.
Before you edit the file, you have to decide whether you w ill install
the Web server software to run as a daemon, or whether it will be
started by inetd. If you anticipate a lot of use, run the software as
a daemon. For occasional use, either is acceptable.
Several variables in httpd.conf-dist need to be checked or have values
entered for them. All the variables in the configuration file follow
the following syntax:
variable value
Note that there is no equal sign or special symbol between the
variable name and the value assigned to it. For example, a few lines
would look like this:
FancyIndexing on
HeaderName Header
ReadmeName README
Where pathnames or filenames are supplied, they are usually relative
to the Web server directory, unless explicitly declared as a full
pathname. The variables you need to supply in httpd.conf-dist are as
follows:
* The AccessConfig variable is the location of the access.conf
configuration file. The default value is conf/access.conf. You can
use either absolute or relative pathnames.
* The AgentLog variable is the log file to record details of
transactions. The default value is logs/agent_log.
* The ErrorLog variable is the name of the file to record errors in.
The default is /logs/error_log.
* The Group variable is the Group ID the server should run as (used
only when server is running as a daemon). It can be either a group
name or group ID number. If it is a number, it must be preceded by
#. The default is #-1.
* The IdentityCheck variable is used to verify that a remote user
has logged in as himself/herself. Not many systems support this
varable. The default is Off.
* The MaxServers variable is the maximum number of children allowed.
* The PidFile variable is the file in which you want to record the
process ID of each httpd copy. The default is /logs/httpd.pid.
Used only when the server is in daemon mode.
* The Port variable is the port number httpd should listen to for
clients. Default port is 80. If you don't want the Web server to
be generally available, choose another number.
* The ResourceConfig variable is the path to the srm.conf file,
usually conf/srm.conf.
* The ServerAdmin variable is the e-mail address of the
administrator.
* The ServerName variable is the domain name of the server.
* The ServerRoot variable is the path above which users cannot move
(usually the Web server top directory or usr/local/etc/httpd).
* The ServerType variable is either stand-alone (daemon) or inetd.
* The StartServers variable is the number of server processes that
can run concurrently (that is, the number of clients allowed).
* The TimeOut variable is the amount of time in seconds to wait for
a client request, after which it is disconnected (default is 1800,
which should be reduced).
* The TransferLog variable is the path to the location of the logs.
The default is logs/access_log.
* The TypesConfig variable is the path to the location of the MIME
configuration file. The default is conf/mime.conf.
* The User variable defines the user ID the server should run as
(only valid if running as daemon). It can be a name or number, but
it must be preceded by # if it is a number. The default is #-1.
The next configuration file to check is srm.conf, which is used to
handle the server resources. The variables that have to be checked or
set in the srm.conf file are as follows:
* The AccessFileName variable is the file that gives access
permissions (default is .htaccess).
* The AddDescription variable provides a description of a type of
file. For example, an entry could be AddDescription "PostScript
file" *.ps. Multiple entries are allowed.
* The AddEncoding variable indicates that filenames with a specified
extension are encoded somehow, such as AddEncoding compress Z.
Multiple entries are allowed.
* The AddIcon variable gives the name of the icon to display for
each type of file.
* The AddIconbyEncoding variable is the same as AddIcon, but it adds
encoding information.
* The AddIconType variable uses MIME type to determine the icon to
use.
* The AddType variable overrides MIME definitions for extensions.
* The Alias variable substitutes one pathname for another, such as
Alias data /usr/www/data.
* The DefaultType variable is the default MIME type, usually
text/html.
* The DefaultIcon variable is the default icon to use when
FancyIndexing is on (default is /icons/unknown.xbm).
* The DirectoryIndex variable is the filename to return when the URL
is for your service only. The default value is index.html.
* The DocumentRoot variable is the absolute path to the httpd
document directory. The default is /usr/local/etc/httpd/htdocs.
* The FancyIndexing variable adds icons and filename information to
the file list for indexing. The default is on. (This option is for
backward compatibility with the first release of HTTP.)
* The HeaderName variable is the filename used at the top of a list
of files being indexed. The default is HEADER.
* The IndexOptions variable specifies the indexing parameters
(including FancyIndexing, IconsAreLinks, ScanHTMLTitles,
SuppressLastModified, SuppressSize, and SuppressDescription).
* The OldScriptAlias variable is the same as Alias. it is included
for backward compatibility with HTPP 1.0.
* The ReadmeName variable is the footer file attached to directory
indexes. The default is README.
* The Redirect variable maps a path to a new URL.
* The ScriptAlias variable is similar to Alias, but it's for
scripts. The default is /usr/local/etc/httpd/cgi-bin.
* The UserDir variable is the directory users can use for httpd
access. The default is public_html. This variable is usually set
to a user's home page directory, or you can set it to DISABLED.
The third file to examine and modify is access.conf-dist, which
defines the services available to WWW browsers. Usually, everything is
accessible to a browser, but you may want to modify the file to
tighten security or disable some services not support ed on your Web
site. The format of the conf-dist file is different from the two
configuration files you saw above. It uses a set of sectioning
directives delineated by angle brackets. The general format of an
entry is:
...
Any items between the beginning and ending delimiters ( and
respectively) are directives. It's not quite that easy
because several variations can exist in the file. The best way to
customize the access.conf-dist file is to follow these steps for a
typical Web server installation:
1. Locate the Options directive and remove the Indexes option. This
step prevents users from browsing the httpd directory. Valid
Options entries are discussed shortly.
2. Locate the first Directory directive and check the path to the
cgi-bin directory. The default path is
/usr/local/etc/httpd/cgi-bin.
3. Locate the second Directory directive for the sym.conf file and
verify the path. The default is /usr/local/etc/httpd/htdocs.
4. Find the AllowOverride variable and set it to None (this setting
prevents others from changing the settings). The default is All.
Valid values for the AllowOverride variable are discussed shortly.
5. Find the Limit directive and set to whichever value you want (see
the next list).
The Limit directive controls access to your server. The valid values
for the Limit directive are:
allow Permits specific hostnames following the allow keyword to access
the service
deny Denies specific hostnames following the deny keyword from
accessing the service
order Specifies the order in which allow and deny directives are
evaluated (usually set to deny,allow but can also be allow,deny)
require Requires authentication through a user file specified in the
AuthUserFile entry
The Options directive can have several entries, all of which have a
different purpose. The default entry for Options is:
Options Indexes FollowSymLinks
The authors removed the Indexes entry from the Options directive in
the first step of the customization procedure. These entries all apply
to the directory the Options field appears in. The valid entries for
the Options directive are as follows:
All Enables all features
ExecCGI Specifies that CGI scripts can be executed in this directory
FollowSymLinks Enables httpd to follow symbolic links
Includes Enables include files for the server
IncludesNoExec Enables include files for the server but disables the
exec option
Indexes Enables users to retrieve indexes (doesn't affect precompiled
indexes)
None No features are enabled
SymLinksIfOwnerMatch Follows symbolic links only if the user ID
matches
The AllowOverride variable is set to All by default, and you should
change this setting. There are several valid values for AllowOverride,
but the recommended setting for most Linux systems is None. The valid
values for AllowOverride are as fol lows:
* A value of All means unrestricted access.
* The AuthConfig value enables some authentication routines. Valid
values are AuthName (sets authorization name of directory),
AuthType (set authorization type of the directory, although there
is only one legal value: Basic), AuthUserFile (specifies a f ile
containing user names and passwords), and AuthGroupFile (specifies
a file containing group names)
* The FileInfo value enables AddType and AddEncoding directives.
* The Limit value enables the Limit directive.
* A value of None means that no access files are allowed.
* The Options value enables the Options directive.
After you have done all that, your configuration files should be
properly set. Although the syntax is a little confusing, reading the
default values will show you the proper format to use when changing
entries. Next, you can start the Web server softwa re.
Starting the Web Software
Begin by copying all your *.conf-dist files (modified in the previous
section) to *.conf (a change in the extension only). Copy the files
instead of renaming them so that you have the original .conf-dist file
for future modifications. The server looks for files with the .conf
extension and will ignore .conf-dist files.
When your configuration is complete, it's time to try out the Web
server software. In the configuration files, you made a decision as to
whether the Web software will run as a daemon (stand-alone) or be
started from inetd. The startup procedure is a li ttle different for
each method (as you would expect), but both startup procedures can use
one of the following three options on the command line:
* The -d option specifies the absolute path to the httpd binary
(used only if the default location is not valid).
* The -f option lists the configuration file to read if it is
different from the default value of httpd.conf.
* The -v option displays the version number.
If you are using inetd to start your Web server software, you need to
make a change to the /etc/services file to enable the Web software.
Add a line like this to the /etc/services file:
http port/tcp
In this line, port is the port number used by your Web server software
(usually 80).
Next, modify the /etc/inetd.conf file to include the startup commands
for the Web server:
httpd stream tcp nowait nobody /usr/web/httpd
The last entry is the path to the httpd binary. Once this is done,
restart inetd by killing the inetd process or by rebooting your
system, and the service should be available through whatever port you
specified in /etc/services.
If you are running the Web server software as a daemon, you can start
it at any time from the command line with the following command:
httpd
Even better, add the startup commands to the proper rc startup files.
The entry usually looks like this:
# start httpd
if [ -x /usr/web/httpd ]
then
/usr/web/httpd
fi
You should substitute the proper paths for the httpd binary, of
course. Rebooting your machine should start the Web server software on
the default port number.
To test the Web server software, use any Web browser and issue a
command in the URL field like this:
http://machinename
Replace machinename with the name of your Web server. If you see the
contents of the root Web directory or the index.html file, all is
well. Otherwise, check the log files and configuration files for clues
as to the problem.
If you haven't loaded a Web browser yet, you can still check whether
the Web server is running by using telnet. Issue a command like this:
telnet www.wizard.tpci.com 80
Substitute the name of your server (and your Web port number if
different than 80). You should get a message similar to this if the
Web server is responding properly:
Connected to wizard.tpci.com
Escape character is '^]'.
HEAD/HTTP/1.0
HTTP/1.0 200 OK
You should also get some more lines showing details about the date and
content. You may not be able to access anything, but this shows that
the Web software is responding properly.
Setting Up Your Web Site
Having a server with nothing for content is useless, so you need to
set up the information you will share through your Web system. This
begins with Uniform Resource Locators (URLs), which are search paths
for data files. Anyone using your service only has to know the URL.
You don't need to have anything fancy. If you don't have a special
home page, anyone connecting to your system will get the contents of
the Web root directory's index.html file, or failing that, a directory
listing of the Web root dir ectory. That's pretty boring, though, and
most users want fancy home pages. To write a home page, you need to
use HTML (HyperText Markup Language).
A home page is like a main menu. Many users may not ever see it
because they can enter any of the subdirectories on your system or
obtain files from another Web system through a hyperlink, without ever
seeing your home page. Many users, however, want t o start at the top,
and that's where your home page comes in. A home page file is usually
called index.html (or home.html if an index file exists). It usually
is at the top of your Web source directories.
Writing an HTML document is not too difficult. The language uses a set
of tags to indicate how the text is to be treated (such as headlines,
body text, figures, and so on). The tricky part of HTML is getting the
tags in the right place, without extra m aterial on a line. HTML is
rather strict about its syntax, so errors must be avoided to prevent
problems.
In the early days of the Web, all documents were written with simple
text editors. As the Web expanded, dedicated Web editors that
understand HTML and the use of tags began to appear. Their popularity
has driven developers to produce dozens of editors, filters, and
utilities, all aimed at making a Web documenter's life easier (and
ensure that the HTML language is properly used). HTML editors are
available for many operating systems.
HTML Authoring Tools
You can write HTML documents in many ways: you can use an ASCII
editor, a word processor, or a dedicated HTML tool. The choice of
which you use depends on personal preference and your confidence in
HTML coding, as well as which tools you can obtain eas ily. Because
many HTML-specific tools have checking routines or filters to verify
that your documents are correctly laid out and formatted, they can be
appealing. They also tend to be more friendly than non-HTML editors.
On the other hand, if you are a ve teran programmer or writer, you may
want to stick with your favorite editor and use a filter or syntax
checker afterwards.
One of the best sites to look for new editors and filters is CERN.
Connect to http://info.cern.ch/WWW/Tools and check the document
Overview.html. Also check the NCSA site, accessible at
http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs where the docume nt
faq-software.html contains an up-to-date list of offerings.
You can use any ASCII editor to write HTML pages, including simple
line-oriented editors based on vi or Emacs. They all enable you to
enter tags into a page of text, but the tags are treated as words with
no special meaning. There is no validity checki ng performed by simple
editors, as they simply don't understand HTML. There are some
extensions for Emacs and similar full-screen editors that provide a
simple template check, but they are not rigorous in enforcing HTML
styles.
If you want to use a plain editor, you should carefully check your
document for valid use of tags. One of the easiest methods of checking
a document is to import it into an HTML editor that has strong type
checking. Another easy method is to simply cal l up the document on
your Web browser and carefully study its appearance.
You can obtain a dedicated HTML authoring package from some sites,
although they are not as common for Linux as for DOS and Windows. If
you are running both operating systems, you can always develop your
HTML documents in Windows, then import them to L inux. Several popular
HTML tools for Windows are available, such as HTML Assistant, HTMLed,
and HoTMetaL. A few of the WYSIWYG editors are also available for X,
and hence run under Linux, such as HoTMetaL. Some HTML authoring tools
are fully WYSIWYG, and others are character-based. Most offer strong
verification systems for generated HTML code.
For the latest Linux or Windows version of HoTMetaL, try the Web site:
ftp://ftp.ncsa.uiuc.edu/Web/html/hotmetal.
An alternative to using a dedicated editor for HTML documents is to
enhance an existing WYSIWYG word processor to handle HTML properly.
The most commonly targeted word processors for these extensions are
Word for Windows, WordPerfect, and Word for DOS. Several extension
products are available, of varying degrees of complexity. Most run
under Windows, although a few have been ported to Linux.
The advantage to using one of these extensions is that you retain a
familiar editor and make use of the near-WYSIWYG features it can
provide for HTML documents. Although it can't show you the final
document in Web format, it can be close enough to prev ent all but the
most minor problems.
CU_HTML is a template for Microsoft's Word for Windows that gives a
almost WYSIWYG view of HTML documents. CU_HTML is a template, meaning
that it adds its own DLLs to Word to enhance the system. Graphically,
it looks much the same as Word, but with a n ew toolbar and pull-down
menu item. CU_HTML provides a number of different styles and a toolbar
of often-used tasks. Tasks like linking documents are easy, as are
most tasks that tend to worry new HTML document writers. Dialog boxes
are used for many task s, simplifying the interface considerably.
The only major disadvantage to CU_HTML is that it can't be used to
edit existing HTML documents because they are not in Word format. When
CU_HTML creates an HTML document, two versions are produced, one in
HTML and the other as a Word .DOC file. Withou t both, the document
can't be edited. An existing document can be imported, but it loses
all the tags.
Like CU_HTML, ANT_HTML is an extension to Word. ANT_HTML has some
advantages and disadvantages over CU_HTML. The documentation and help
are better with ANT_HTML, and the toolbar is much better. There's also
automatic insertion of opening and closing ta gs as needed.
However, ANT_HTML requires that any inline GIF images be inserted
instead of using a DLL. This means that you may have to hunt for a
suitable filter. Also, like CU_HTML, ANT_HTML can't handle documents
that were not produced with ANT_HTML.
One system that has gained popularity among Linux users is tkWWW. A
tool for the Tcl language and its Tk extension for X, tkWWW is a
combination of a Web browser and a near-WYSIWYG HTML editor. Although
originally UNIX-based, tkWWW has been ported to s everal other
platforms, including Windows and Macintosh.
______________________________________________________________
NOTE: tkWWW can be obtained through anonymous ftp to
ftp.aud.alcatel.com in the directory /pub/tcl/extensions. Copies of
Tcl and Tk can be found in several sites depending on the platform
required, although most ver sions of Linux have Tcl and Tk included
in the distribution set. As a starting point, try anonymous FTP to
ftp.cs.berkeley.edu in the directory /ucb/tcl.
______________________________________________________________
When you create a Web page with tkWWW in editor mode, you can then
flip modes to browser to see the same page properly formatted. In
editor mode, most of the formatting is correct, but the tags are left
visible. This makes for fast development of a Web page.
Unfortunately, tkWWW must rely on Tk for its windowing, which tends to
slow things down a bit on average processors. Also, the browser aspect
of tkWWW is not impressive, using standard Tk frames. However, as a
prototyping tool, tkWWW is very attractive , especially if you know
the Tcl language.
Another option is to use an HTML filter. An HTML filter is a tool that
lets you take a document produced with any kind of editor (including
ASCII text editors) and convert the document to HTML. Filters are
useful when you work in an editor that has its own proprietary format,
such as Word or nroff.
HTML filters are attractive if you want to continue working in your
favorite editor and simply want a utility to convert your document
with tags to HTML. Filters tend to be fast and easy to work with
because they take a filename as input and generate a n HTML output
file. The degree of error checking and reporting varies with the tool.
Filters are available for most types of documents, many of which are
available directly for Linux, or as source code that can be recompiled
without modification under Linux. Word for Windows and Word for DOS
documents can be converted to HTML with the CU_HTML and ANT_HTML
extensions mentioned earlier. A few stand-alone conversion utilities
have also begun to appear. The utility WPTOHTML converts WordPerfect
documents to HTML. WPTOHTML is a set of macros for WordPerfect
versions 5.1, 5.2, and 6.0. The W ordPerfect filter can also be used
with other word processor formats that WordPerfect can import.
FrameMaker and FrameBuilder documents can be converted to HTML format
with the tool FM2HTML. FM2HTML is a set of scripts that converts Frame
documents to HTML while preserving hypertext links and tables. It also
handles GIF files without a problem. Bec ause Frame documents are
platform-independent, Frame documents developed on a PC or Macintosh
could be moved to a Linux platform and FM2HTML executed there.
______________________________________________________________
NOTE: A copy of FM2HTML is available by anonymous FTP from
bang.nta.no in the directory /pub. The UNIX set is called
fm2-html.tar.v.0.n.m.Z.
______________________________________________________________
LaTex and TeX files can be converted to HTML with several different
utilities. Quite a few Linux-based utilities are available, including
LATEXTOHTML, which can even handle in-line LaTeX equations and links.
For simpler documents, the utility VULCANIZE is faster but can't
handle mathematical equations. Both LATEXTOHTML and VULCANIZE are Perl
scripts.
______________________________________________________________
NOTE: LATEXTOHTML is available through anonymous FTP from
ftp.tex.ac.uk in the directory pub/archive/support as the file
latextohtml. VULCANIZE can be obtained from the Web site http://w
ww.cis.upenn.edu in the directory mjd as the file vulcanize.html.
______________________________________________________________
RTFTOHTML is a common utility for converting RTF format documents to
HTML. Many word processors handle RTF formats, so you can save an RTF
document from your favorite word processor and then run RTFTOHTML
against it.
______________________________________________________________
NOTE: RTFTOHTML is available through anonymous FTP from
ftp.cray.com in the directory src/WWWstuff/RTF. Through the Web,
try http://info.cern.ch/hypertext/WWW/Tools and look for the file
rtftoftml-2.6.html (or a later version).
______________________________________________________________
Maintaining HTML
Once you have written a Web document and it is available to the world,
your job doesn't end. Unless your document is a simple text file, you
will have links to other documents or Web servers embedded. You must
verify these links at regular intervals. A lso, the integrity of your
Web pages should be checked at intervals, to ensure that the flow of
the document from your home page is correct.
Several utilities are available to help you check links and to scan
the Web for other sites or documents you may want to provide a
hyperlink to. These utilities tend to go by a number of names, such as
robot, spider, or wanderer. They are all programs that moves across
the Web automatically, creating a list of Web links that you can
access. (Spiders are similar to the Archie and Veronica tools for the
Internet, although neither of these cover the Web.)
Although they are often though of as utilities for users only (to get
a list of sites to try), spiders and their kin are useful for document
authors, too, as they show potentially useful and interesting links.
One of the best known spiders is the World Wide Web Worm, or WWWW.
WWWW enables you to search for keywords or create a Boolean search,
and it can cover titles, documents, and several other search types
(including a search of all known HTML pages).
A similarly useful spider is WebCrawler, which is similar to WWWW
except that it can scan entire documents for matches of any keywords
and display the result in an ordered list from closest match to least
match.
______________________________________________________________
NOTE: A copy of World Wide Web Worm can be obtained from
http://www.cs.colorado.edu/home/mcbryan/WWWW.html. WebCrawler is
available from
http://www.biotech.washington.edu/WebCrawler/WebCrawler.html.
______________________________________________________________
A common problem with HTML documents as they age is that links that
point to files or servers may no longer exist (either because the
locations or documents have changed). Therefore, it is good practice
to validate the hyperlinks in a document on a reg ular basis. A
popular hyperlink analyzer is HTML_ANALYZER. It examines each
hyperlink and the contents of the hyperlink to ensure that they are
consistent. HTML_ANALYZER functions by examining a document to all
links, then creating a text file that has a list of the links in it.
HTML_ANALYZER uses the text files to compare the actual link content
to what it should be.
HTML_ANALYZER actually does three tests: it validates the availability
of the documents pointed to by hyperlinks (called validation); it
looks for hyperlink contents that occur in the database but are not
themselves hyperlinks (called completeness); an d it looks for a
one-to-one relation between hyperlinks and the contents of the
hyperlink (called consistency). Any deviations are listed for the
user.
HTML_ANALYZER users should have a good familiarity with HTML, their
operating system, and the use of command-line driven analyzers. The
tool must be compiled using the "make" utility prior to execution.
There are several directories that must be created prior to running
HTML_ANALYZER, and it creates several temporary files when it runs
that are not cleaned up, so this is not a good utility for a novice.