htmlsrpl version 1.11, January 22 1995

Name:

     htmlsrpl.pl - HTML-aware search-and-replace program, with
     either literal strings or regular expressions.  Acts either
     only outside HTML/SGML tags, or only within tags; can be
     restricted to operate only within and/or only outside
     specified elements; can also upper-case tag names.  Runs
     under perl.


Typical use:

   perl htmlsrpl.pl [options] infile.html > outfile.html

  Where command-line options have the form "option=value" (without whitespace
  on either side of the `=' character), and all options should precede
  filename arguments on the command line.


Basic command-line options:

  old="..."         String or expression to be replaced.  Must be defined and
                non-null (unless the upcase=1 option is specified).

  new="..."         The new replacement string or expression.  If ``new='' is
                absent or null, the old="..." string is deleted.

  intags=1          If this option is specified on the command line, strings
                within tags are changed, but not text outside of tags.  (The
                default action, if this option is absent, is to only replace
                text outside of tags.)


Element inclusion/exclusion command-line options:

  inside=...        The value of this option is a tagname or a comma-separated
                list of tagnames (e.g. inside=A or inside=b,i).  Search and
                replace operations will only take place in material that is
                contained within all the specified elements.  So if inside=b,i
                has been specified on the command line, only "Text3" in the
                following input file would be subject to search and replace:
                "Text1<B>Text2<I>Text3</I></B>".  The order of inclusion makes
                no difference (so that <B> nested inside <I> would be treated
                exactly the same as <I> nested inside <B>).


  outside=...       Search and replace will only take place outside the tag or
                (comma-separated) list of tags specified with this option.  So
                if outside=b,i is specified, nothing contained within a
                <B>...</B> or <I>...</I> element will be subject to search and
                replace.

  inmost=...        The same as inside=, except that search and replace only
                occurs _immediately_ within the element specified (i.e.
                inmost=b would mean that only "Text2" would be subject to
                search and replace in "Text1<B>Text2<I>Text3</I></B>").

   If more than one of these options is specified, search-and-replace only
  takes place when all the conditions specified in the options are satisfied.

   This program uses a rather simple-minded algorithm for determining what
  is contained within an element.  There is a small list of known non-pairing
  tags (such as <IMG>, <BR>, etc.).  When any opening tag not on this list is
  encountered, it is pushed onto a stack of presently-containing elements.
  When any closing tag is encountered, the most-recently occurring matching
  tagname is removed from the stack, along with everything above it in the
  stack (if no matching opening tag has been encountered, htmlsrpl.pl exits
  with an error -- use the htmlchek program in this package to help find the
  HTML error).  This means, for example, that a <P> element unclosed by a </P>
  will often be considered to extend much farther than it should according
  to the HTML DTD; also, in a list such as "<DL><DT>Text1<DD>Text2</DL>",
  "Text2" is actually considered to be contained within a <DT> element.

   Note that when the inside=, inmost=, or outside= options are used
  together with the intags=1 option, a tag is never considered to be
  contained within the element which it itself delimits (i.e. the inclusion
  and exclusion relationships established by a tag come into force at the end
  of the tag if it is an opening tag, and at the beginning of the tag if it
  is a closing tag).  Also, inclusions and exclusions are always calculated
  from the unprocessed input, before any search and replace has taken place.


Regexp command-line options:

  regexp=1          If this option is specified, old="..." is used as a Perl
                 regular expression, rather than as a simple literal string
                 (the default is that both old="..." and new="..." are handled
                 as simple literal strings).  See the Perl documentation for
                 information on regular expressions.  Special characters that
                 are shell metacharacters will have to be quoted on the
                 command line, to protect them from interpretation by the
                 shell.  The `/' character should be escaped by a preceding
                 backslash, or should be written as "\057", since this
                 character is used as the delimiter in the Perl s/.../.../
                 construct.

  regeval=1         If this option is specified, old="..." is used as a
                 regular expression, and new="..." is a statement to be
                 evaluated, as in the Perl s/.../statement/e construct.
                 Special variables such as $`, $&, $', $1 etc. can be used as
                 part of such a statement (remember that the "." operator is
                 used to concatenate string values).  If you use an erroneous
                 expression, you will get a Perl errormessage (not a htmlsrpl
                 errormessage), which you will have to interpret using the Perl
                 manual.

  case=1            If this option is specified along with the regexp=1,
                 regeval=1, or delete=1 options, then they operate without
                 caring about alphabetic case.


Command-line options that affect what is matched against:


  lines=1           If this option is specified, the chunks of the input file
                 that will be individually searched and replaced are those
                 that result when tag beginnings (`<') and tag endings (`>')
                 are boundaries; these chunks can contain embedded newlines.
                 (Remember that in Perl the regexp /./ does not match newline
                 ("\n"); you can use [^\000] instead.)
                     If the lines=1 option is not specified, then the default
                 behavior is that linebreaks are also boundaries; the chunks
                 then do not contain newlines.  The `<' and `>' characters
                 themselves are never part of the chunks matched against (they
                 can only be altered by use of the delete=1 option), except
                 for `>' characters outside of tags, which are treated as
                 ordinary text.

  slash=1           If this option is specified, then the `/' slash character
                 immediately following the `<' character of a closing tag is
                 not matched against, and is not affected by any search-and-
                 replace operation (except, of course, tag deletion with
                 delete=1).  Implies intags=1.

  delete=1          If this option is specified, old="..." is treated as a
                regexp and is matched against tagnames (not against the entire
                contents of tags); where tagnames match, the entire tag,
                including the surrounding `<' and `>' characters, is deleted.
                This option implies intags=1 and slash=1, and is incompatible
                with regexp=1, regeval=1, or a non-null value of new=.


Uppercasing option:

  upcase=1          If this option is present, then tag names (the sequence of
                 non-whitespace immediately following a `<' character) are
                 upper-cased.  Does not upper-case tag options (attributes).
                 If old= is null or absent, then this is the only thing that
                 htmlsrpl.pl does, and any other command-line options are
                 ignored.  Otherwise, uppercasing is done first, before any
                 specified search-and-replace operation (and the intags=1
                 option is assumed).  Note that qualifiers like `inmost=' will
                 govern the scope of any search-and-replace operation that
                 accompanies uppercasing, but uppercasing itself always
                 affects all tags.


Final status message:

  At the end of processing, if no errors occurred, htmlsrpl.pl outputs a
  message to STDERR (either "Changed!" or "Unchanged"), informing whether
  or not any substitutions were actually performed on the output.


Summary:

  You can do some cute things by playing around with these options.  For
  example, ``perl htmlsrpl.pl regexp=1 old=".*"'' deletes all text (except
  newlines) outside tags, while adding ``intags=1'' to this command line means
  that all text inside tags is deleted instead (leaving ghostly ``<>'' markers
  behind).  The command line ``perl htmlsrpl.pl delete=1 case=1 old="blink"''
  nukes any <BLINK> tags (yay!), while ``perl htmlsrpl.pl slash=1 case=1
  lines=1 regexp=1 old="^blink[^\000]*" new="I"'' will change all BLINK tags,
  with accompanying attributes (possibly on multiple lines), and replace them
  with the appropriate opening <I> and closing </I> tags.  A command like ``perl
  htmlsrpl.pl outside=cite,h1,h2,h3,h4,h5,h6,title old="Pride and Prejudice"
  new="<cite>Pride and Prejudice</cite>"'' can be used to add mark-up in the
  appropriate places.


Limitations:

  A limitation of this program is that it always treats `<' and `>' in the
  input file as tag-beginning and tag-ending characters (even in comments),
  and terminates prematurely if `<' and `>' are found in inappropriate places
  (except that loose `>' characters outside tags are harmless).  In this case
  a "die" message will be output to STDERR, and the last line of the output
  will be "ERROR!".

  If you misspell an option name, then you'll either get an error when Perl
  tries to open a file with that name, or you'll get an indiscriminate
  "No `old=' string was specified" errormessage.

  The program processes all files on the command line to STDOUT; to process a
  number of files individually, use the iteration mechanism of your shell; for
  example:

 for a in *.html ; do perl htmlsrpl.pl old=ABC new=XYZ $a > otherdir/$a ; done

  in Unix sh, or:

 for %a in (*.htm) do call htmlsrpl %a otherdir\%a

  in MS-DOS, where htmlsrpl.bat is the following one-line batch file:

 perl htmlsrpl.pl old=ABC new=XYZ %1 > %2


Author:

  Copyright H. Churchyard 1994, 1995 -- freely redistributable.  This code is
  functional but not very well commented or aesthetic -- sorry!  If you find
  an error in this program,  e-mail me at churchh@uts.cc.utexas.edu.

htmlsrpl version 1.11, January 22 1995
