Using rtftohtml and rtftoweb


(Whenever speaking of rtftohtml in the following I mean rtftohtml extended by rtftoweb.)

To convert a document from RTF (Rich Text Format) to HTML, rtftohtml requires the contents of the RTF-file to be formatted with a certain set of paragraph styles. For example, headings at level 1 must be formatted with the paragraph style "heading 1" (which is the built-in default for headings anyway; german heading styles may be called "Überschrift xy", but they appear in the RTF file as "heading xy", too), lists must be formatted with a paragraph style such as "numered list" etc. The reason for this is that rtftohtml needs to know which paragraph styles it should map to which HTML tags. This mapping between styles and tags can be customized be editing the file html-trans in rtftohtml's library directory (see section html-trans for more), to create a mapping from your own individual paragraph styles to HTML-tags. Although this is not as complicated as it might seem, I personally prefer to adjust my Word-documents to use only (or at least mostly) the paragraph styles recognized by rtftohtml by default. In this chapter I will stick to this strategy. See section "Adding paragraph styles" for a few words on how to customize rtftohtml to correctly interpret your own paragraph style.

To make the creation and preparation of Word documents that are to be converted to HTML as easy as possible, I have included a style file for Microsoft Word 6.0, called rtftoweb.dot into the rtftoweb-tar-file. Section "A .dot file for WinWord" describes the usage of this file in more detail.

Supplying a title

To determine the HTML-Title for the created HTML-Files (the text between the <title> and </title> tags), rtftohtml looks for the \title-token inside the \info-group of the RTF-File. Thus you should give your RTF-Documents a short, descriptive title in the respective dialog box of your word processor (should be called something like "File information").

Another way to specify the document title is via the -T command line option. For example:

	rtftohtml -T "My work of art" art.rtf

Note that this title will also be automatically inserted by rtftohtml into the first created HTML-File as a level-1-heading. That's why you should usually delete the very first heading from your RTF-Document (or at least assign a different paragraph format to that line) and use it as the document title. The reason for this is to prevent rtftoweb from interpreting the headline of your RTF-Document as a level 1 heading, where it should split.

Character styles

rtftohtml automatically recognizes and converts bold, italic and underlined text. If a certain range of text is written using a monospaced font such as Courier, it also automatically creates monospaced HTML-output for that range. What fonts are considered to be monospaced can be configured in the file html-trans in section .TMatch ("monospace fonts -> tt"). By default the fonts "Courier", "Courier New" and "Palatino" are expected to be monospaced.

If you get warning messages such as "no output translation for ..." when running rtftohtml you can either replace that character with a less exotic one in your RTF-file or add a translation to the end of rtftohtml's library file html-map, such as "character translation".

The newline character (created by Shift-Return) will be automatically
converted to the corresponding HTML-tag,
as will the unbreakable space (created by Control-Shift-Space).

Headings

Headings must be formatted with a paragraph style like "heading 1", "heading 2" etc. (resp. "Überschrift 1" etc.) to be automatically recognized by rtftohtml. rtftohtml uses these styles to determine when it should split the HTML-file. The heading level at which splitting should take place can be configured by the command line switch -hlevel (see section Command line options). If a heading contains no text (i.e. it is empty) it will be ignored by rtftohtml.

If the -h switch was present when rtftohtml was invoked, a navigation panel will be inserted at the top and at the bottom of every generated HTML file. This navigation panel will contain the following elements:

rtftohtml will try to use the language of the RTF-file for labelling the navigation panel. Currently there is support for english, spanish, french and german. However, if you would like a more fancy-looking panel, with buttons etc., you can tell rtftohtml (by writing a simple configuration file) what HTML-code it should use for the individual panel elements. The creation of such configuration files is described in detail in section Navigation panels.

Lists

rtftohtml knows about the following lists (in braces is the name of the respective paragraph style it expects such lists to be formatted with):

numbered ("numbered list")
items start with a tab and end with a paragraph mark (numbers before the tab are ignored)
unnumbered ("bullet list")
items start with a tab and end with a paragraph mark (bullets etc. before the tab are ignored)
Glossaries ("glossary")
term and definition are separated by a tab, glossary entries are separated by a paragraph mark

Nested lists can be created from an RTF document by using a different style for each level of indentation. The styles "bullet list 1" "numbered list 2" ... represent different levels of nesting, with "bullet list 1" being at nesting level 1. The only rule for use is that no levels of nesting are skipped. For example, a "numbered list 3" paragraph must not appear immediately after a "Normal" paragraph. It must follow a paragraph with a nesting level of 2 or higher.

An example sequence of paragraph styles to produce a nested list might look like this:

numbered list
	bullet list 1
		bullet list 2
		glossary 2
	bullet list 1
		numbered list 2

Tables

rtftohtml is able automatically convert tables to HTML by generating a range of preformatted text to keep the cells in their place. For this reason only plain text is allowed in tables. Bold and italic text in tables should be possible in the next release of the rtftoweb patches. Tables produced/converted by rtftohtml look something like this:

Column 1, Row 1              Column 2, Row 1              Column 3, Row 1              
Column 1, Row 2              Column 2, Row 2              Column 3, Row 2              
Column 1, Row 3              Column 2, Row 3              Column 3, Row 3              

If sometimes I have really got a lot of time on my hands I am planning to add support for tables as realized by the upcoming HTML 3.0 specification. Of course this would require you to use a HTML 3.0 capable browser such as Arena or Netscape.

Images

Graphics are imbedded in RTF in either a binary format or an (ASCII) hex dump of that binary. I have never seen a binary format graphic - I don't think that the filter will process binary correctly. It does handle the hex format of graphics, by converting the hex back into binary and writing the binary to a file. The file extension is chosen by looking at the original type of the graphic. The following list shows the file types and their extensions:

Macintosh PICT
.pict - also, 256 bytes of nulls are prepended to the graphic. This is to conform to the PICT file format.
Windows Meta-files
.wmf
Windows Bit-map
.bmp

In addition, the filter produces a link to the file containing the graphic. Now, since the above graphic formats are not very portable, the filter assumes that you will convert these files to something more useful, like GIF. So the format of the link is:

<a href="basenameN.ext">Click here for a Picture</a>

where

Since most Web browser only support images in GIF-format, you will have to convert the generated PICT- and WMF-files to GIF. For PICT there is picttoppm/ppmtogif, but for WMF? I don't know of any WMF translators for Unix; for DOS there is wmf2bmp, whose output could then be converted to GIF via the pbmplus-tools. From what I understand, WMF is not a pixel- but a vector-graphic format, so maybe it would be easier to translate WMF to Postscript and then let Ghostscript do the job of converting to GIF. Any volunteers for writing a wmftops utility?

You can also change the link to an IMG form. If you specify the -I command line option, all links to graphics will be of the form:

<IMG src="basenameN.ext">

There is one other special case. If a graphic is encountered when the filter is in the process of generating a link, the IMG form of the link is used even without the -I command line option.

Cross references

All kinds of cross references can be created from within the RTF-file. The reference itself must be formatted with the attributes "double-underline/hidden" and must follow the standard HTML-conventions, such as "http://www.w3.org" or "file.txt" or "#mark1". The "hot" text, that is the text that will appear "clickable" in your Web-browser, immediately follows the reference and must be double-underlined, but not hidden.

Anchors for internal cross references (such as "mark1", corresponding to the example above) must be formatted either with the attributes "hidden/outline" or "hidden/superscript". For example this link will bring you to the list of new features in rtftoweb 1.6.

If you just want to create a reference to a certain heading resp. section, it is sufficient to simply format the reference with the color red (when using rtftoweb.dot: mark the reference and press Control-Shift-r). The text of the reference must match the beginning n characters of the heading, so the references "Supplying" and "Supplying a title" point to the same section.

If an email address such as bolik@irb.uni-hannover.de is colored red, rtftohtml will automatically produce a cross reference of type "mailto". Not all Web browser support this type of references (Netscape does).

The same work for all other kinds of URLs, so if the URL ftp://ftp.rrzn.uni-hannover.de/pub/ is colored red, rtftohtml will automatically produce a reference pointing to that URL.

Index entries and footnotes

If your RTF document contains footnotes or endnotes, the filter will place the text of the footnote in a separate HTML document. At the footnote reference mark, the filter will generate a hypertext link to the text of the footnote. This works with either automatically numbered footnotes[1], or user supplied footnote reference marks[+]

If you insert index entries into your RTF-document and give rtftohtml the -x-option, rtftohtml will generate a hypertext'ish index for the generated HTML-documents. Note that when using NCSA-Mosaic as your Web browser you should also tell rtftohtml to insert some text into the generated anchors by using the command line switch -X text (see section Command line options).

Other features

Horizontal lines

The paragraph style "hr" can be used to produce a horizontal line in the HTML output (this will be translated to the <hr> tag).

Discarding Unwanted Text

If you have text that you do not want to appear in the HTML output, simply format the text as Hidden and Plain (that is, no underline, outline...)

If you wish to modify the formatting that discards text, you need to change the entry in html-trans that specifies "_Discard".

Imbedding HTML in a Document

Normally, if your RTF document contained the text "<cite>hello</cite>", the translator would output this as: "&lt;cite&gt;hello&lt;/cite&gt;". This ensures that the text would appear in your HTML output exactly as it appeared in the original RTF document. If, however, you want the <cite></cite> to be interpreted as HTML markup, you must format the tags using Hidden and Shadow or Hidden and Strikethrough. The filter will then send the tags through without translation. It is also possible to use the paragraph style "HTML" to let rtftohtml interpret a whole paragraph as being literal HTML.

When the rtftohtml filter produces HTML markup, it keeps track of the nesting level of tags to ensure that you don't get something like <b><cite>hello</b></cite> which would be incorrect markup. If you imbed HTML markup in your document, the filter will NOT be aware of it. You must ensure that your markup appears correctly nested.

If you wish to modify the formatting for imbedded HTML, you need to change the entry in html-trans that specifies "_Literal".

Other paragraph styles

rtftohtml understands a few other paragraph styles by default. These are (among others):

address
Will be converted to HTML's <address>-environment.
blockquote
Will be converted to HTML's <blockquote>-environment.
pre
Will be converted to HTML's <pre>-environment. This is useful when spacing is important in a paragraph.

A .dot file for WinWord

While using rtftohtml myself I have created a style file for Microsoft Word 6.0 called rtftoweb.dot (I have also a less sophisticated dot-file for Word 2.0 lying around somewhere, mail me (zzhibol@rrzn-user.uni-hannover.de) if you are interested). By using this file as the standard document type for your documents it gets really easy to create RTF-documents which can be translated by rtftohtml without any problems. You make your documents use rtftoweb.dot by following the same procedure as usual when assigning dot-files. In german Word 6.0 this is (sorry, currently there are no english instructions available):

rtftoweb.dot adds all the paragraph styles which are understood by rtftohtml (without modifying html-trans) to your document. Additionally some keyboard shortcuts are now defined (or possibly redefined...):

Ctrl-Shift-1 ... Ctrl-Shift-6
Selects paragraph style "heading 1" ... "heading 6".
Ctrl-Shift -p
Selects paragraph style "pre".
Ctrl-Shift -b
Selects paragraph style "bullet list".
Ctrl-Shift -n
Selects paragraph style "numbered list".
Ctrl-Shift -g
Selects paragraph style "glossary".
Ctrl-Shift -p
Selects paragraph style "pre".
Ctrl-Shift -h
Formats the selected text for plain HTML.
Ctrl-Shift -r
Formats the selected text with the color red (for Cross references).
Ctrl-Shift -i
Formats the selected text to be the destination of a Cross reference.
Ctrl-Shift -u
Formats the selected text to be the "hot text" of a Cross reference.
Ctrl-Shift -a
Formats the selected text to be the anchor of a Cross reference.
Ctrl-Shift -c
Formats the selected text to use font "Courier New".
Ctrl-Shift -t
Formats the selected text to use font "Times New Roman".