The Ten Commandments For Comparison
Of A Legal Document in MS Word
1. Thou Shalt follow "The Seven Laws Of Outline Numbering".
2. Thou Shalt Not Boast a Bad Binary File Header, or Use "Fast
Saves".
3. Thou Shalt Thoroughly Cleanse Thy Scanned Documents.
4. Thou Shalt Properly Paginate Any Document To Be Compared.
5. Thou Shalt Unlink All Field Codes, Turning Them to Text –
and While You’re At It: Remove Any {TA}/{TOA} Field Codes
from Thy Content.
6. Thou Shalt Practice Safe Pasting From Other Windows Applications.
7. Thou Shalt Not Permit Large Populations of Directly-Formatted
Paragraphs, Empty Paragraph
8. Markers or "Creepy" Section Breaks to Remain in Thy
Documents.
9. Thou Shalt Be Respectful When Comparing Large Tables and Linked
or Embedded "Stuff".
10. Thou Shalt Be Mindful That Special Characters be Consistently
Created, Especially in Footnotes.
11. Thou Shalt Unburden All Documents of Active Or Dormant Revision
Tracking.
Punishment For Ignoring The Ten Commandments For Comparison Of A
Legal Document In Word 97:
Abiding by "The Ten Commandments" is clearly a choice,
not a mandate. If ignored, at a minimum one might suffer output
which is undesirable, requires further editing, or does not log
all changes; at worst, an illegal operation (e.g., crash) of the
application. When obeyed, however, the payback yields more than
a desirable document comparison moment. Subscription to "The
Ten Commandments" reinforces document production skills which
bring us – and our documents – out of this
century, and, gratefully, onto the next…
Definition Of Terms
*Binary File Header
– What’s left of the document when you remove its content.
In tangible terms, this refers to the style list, properties of
the document’s first section, numbering defined in the document
(e.g., the "ListTemplate" collection), or information
stuffed into the File/Properties fields – in short: the stuff
held in the all-powerful final paragraph marker of a Word document.
*RTF File Format – This stands
for "Rich Text Format" and it has been an evolving Microsoft
standard since Word’s early days. Originally devised as an
ASCII file format which provided a mechanism for transporting files
across multiple platforms where binary file transfers were compromised
by differences in operating and file systems, RTF grew into the
standard interchange format used by application vendors to read
and write Word documents. Interestingly, it is also the ‘standard’
text format used by the Windows Clipboard. With the release of Office
97, the RTF file format specification was updated to provide for
the expanded attributes of the Word 97 binary file format.
*The ListTemplate – Word-speak
for individual numbering schemes defined within any document.
* Directly Applied Paragraph Formats –
Any paragraph format that is contrary to the style below. In the
context of this article, the troublesome attributes generally refer
to paragraph indents, line spacing choices or custom tabstops applied
directly to the majority of paragraphs in the document. In this
case, directly-applied formats would be more efficiently inherited
by text in the document as properties of a paragraph style. The
command, [Shift-F1], reveals the formatting attributes of any text
on which you click. To strip away directly-applied paragraph formatting
use [Ctrl-Q] ([Ctrl-SpaceBar] clears character formats), or simply
modify the underlying style to include the attributes that are directly
applied. Either technique will more efficiently format the document.
PLEASE NOTE: This in no way indicates that directly-applied attributes
are not allowed in a document, but rather invites consideration
that the use of styles invites a significant gain in productivity.
For example, a document where double-spacing has been directly-applied
to all paragraphs would behave more efficiently in Word if this
attribute were added to the underlying paragraph style.
* Unlink Field Codes – Word’s
field codes – those commands encased in the { and } curly
braces – are displayed with [Alt-F9], updated when selected
with the [F9] key, and unlinked when selected using [Ctrl-Shift-F9]
When a field code is unlinked, it becomes text – the dynamic
nature of the field code, then, is removed.
1. Thou Shalt follow "The Seven Laws Of Outline Numbering".
Without argument, the power of outline numbering introduced in Word
97 has provided compelling reasons for our nation’s law firms
to migrate away from their previous legacy of a code-based word
processor to the styles-based document production environment afforded
by Microsoft Word. Sadly, our users – not to mention their
supporting cast – are not adequately prepared to use nor troubleshoot
this powerful "new" feature. Microsoft uses a term, the
"Futz Factor", to describe the inappropriate, unanticipated,
and largely unrepeatable application of a new software feature.
Indeed, in the context, and in our community, Word 97’s numbering
features have invited a fair amount of "Futz"!
As usability issues associated to Outline
Numbering began to overwhelm the community late last year, Microsystems
released a document titled "The Seven Laws of Word 97’s
Outline Numbering Linked to Styles". Its intent was not only
to explain the feature’s use, but to promote "Best Practices"
for training, supporting, troubleshooting and applying Outline Numbering.
While disappearing numbers, unlinked styles
and other nuisances occur in a document where the uninformed user
has attempted to apply Outline Numbering, the more debilitating
moment in which the feature’s misuse turns up is at the moment
when executing Lexis-Nexis’ CompareRite product. Those who
are less document-inclined might be quick to point to CompareRite
as the guilty application here, but a saucer-deep search through
the document will more often than not, reveal that one or more of
"The Seven Laws" has been violated.
Most common outcomes of numbering gone
awry in a document are:
Crashing of CompareRite.
Textual outline numbers in the resulting document, but inaccurate
{LISTNUM} values.
Mixture of directly applied numbering and textual numbering in the
resulting document, thus producing inaccurate numbering results.
Bulleted lists producing odd font formatting, and/or inappropriately-mapped
bullet characters.
A study of "The Seven Laws" might bring some insight to
your numbering practices, and indeed could point out some of the
compromises encountered when the document travels through the CompareRite
process.
2. Thou Shalt Not Boast a Bad
Binary File Header.
Community-wide concerns have been expressed speaking to the lack
of reliability of CompareRite in a Word 97 environment – most
especially in comparison to (pun intended!) the outstanding reliability
we saw in the WordPerfect 5.1-era. To shed some perspective here,
the statistics say it all: the ‘weight’ of a binary
file created in WordPerfect 5.1 was just 314 bytes, but in Word
97, it is nearly 20,000 bytes – almost double its own preceding
version (Word 6/7) which weighed in at 11,000 bytes. The complexities
associated to the document itself – before content has been
added – have increased nearly 63 to 1! This articulates, too,
the increased opportunities for the comparison – and a myriad
of other conversion moments – to become ‘less reliable’
in the eyes of our users, and expertly defines what has happened
to document production requirements since the fondly recalled days
of the ‘blue screen’. The intent of this commandment,
then, is to bring awareness to identifying a troubled document –
one whose binary file header is of suspect origins – which,
when compared, will undoubtedly undermine the application’s
attempts to produce an output document.
To ‘do its thing’, CompareRite
must spawn a conversion of your Word 97 document. To do so, it uses
an updated RTF format released by Microsoft in conjunction with
the update to Word 97. As such, the integrity of the binary file
header of the document is of paramount importance. To the layman,
a binary file header means
*The document’s style
list. This cannot include previously-converted styles which
appear ‘suspect’. Most often, these come from WordPerfect
documents, or the abused, round-tripped document. Suspect style
names might include odd symbols, truncated names ("Default
Para" rather than "Default Paragraph Font"),
or no name at all. You might also verify that non-Standard collections
of styles do not appear in the document – most often generated
by a scanner session gone weird, or by a further conversion
of a scanner session gone awry ("OmniPage#1", "OmniPage#2",
… or ., "P1", "P2",…) …
* The document’s first section. The properties of this
section – nor any other section, for that matter –
must not contain poorly-converted headers or footers filled
with ill-placed Text Frames or a significant number of superfluous
or aborted field codes (refer to the Fifth Commandment:).
* Numbering. If numbering is inconsistently-applied or incompletely-defined,
the RTF filter can be thwarted at converting the document. Remember:
in Word 97 numbering, whether it be bullets, numbers, or outline
schemes, is a property of the document not of the paragraph.
Any incompleteness here will surface when the document is prepared
for comparison.
* The bookmark list. A bookmark list piled full of entries (we’ve
seen as many as 996!), does not provide an efficient –
nor successful – CompareRite moment.
One way to test whether the binary file
header (e.g., all parts of the file, irrespective of its content)
is the culprit would be to delete all but a few words of the content
of the original and revised files, then SaveAs to new names. If
these files crash CompareRite, you know the file itself, not its
content, is the real problem. The most popular reasons for this
event are documents whose binary file header were manufactured by
an application other than Word 97, or – and we’ve seen
a fair number of these – documents in which numbering has
been incorrectly or incompletely defined.
One additional side note about Word’s
own Compare feature: The feature was introduced way back in the
Word 1.0/2.0 era. Since that time few enhancements have been made
to the utility, while many enhancements have been made to Word.
itself Many folks report that the feature will often miss whole
hunks of changes, or may even hang on large comparisons. Once again,
the increased complexities of the document are at work here: Word
2.0 files, were indeed simpler to deal with.
3. Thou Shalt Thoroughly Cleanse
Thy Scanned Documents.
With Versions 6 and 7 of Microsoft Word, we saw scanning utilities
which created files containing unique style names applied to every
paragraph: "P1", "P2", "P3", or "OmniPage#1",
"OmniPage#2", etc. What we’re seeing is that when
these documents are brought forward to Word 97 – which means
they’ve undergone yet again, an internal binary file conversion
– the result is a document which can behave oddly. These documents
exhibit such symptoms as a Normal style will not stick – changing
without reason – and numbering definitions which are troublesome
to maintain. Indeed when these documents are compared (that is,
converted to compare), the CompareRite application will crash.
A hybrid mutant file type is now evolving
where newer scanning software produces either a binary Word 97 output
format or an accelerated RTF file. In these documents, the software
has shifted to using bookmarks in lieu of style placeholders. Bookmarks,
in and of themselves, are no problem – nor, for that matter
are the series of style definitions produced by these applications.
What is the problem, however, is that should CompareRite have to
sift through all this ‘stuff’ in order to reach the
real heart of a meaty legal document, this level of compromise increases
the application’s difficulties at getting there! We’ve
seen CompareRite crash on bad style names (e.g., "no name"),
and we’ve seen it crash on a list of bookmarks that were 996
entries long. In both of these situations, the user was not at fault:
some other application had manufactured the file – and the
user simply was not aware that this ‘stuff’ needed to
be removed.
It goes without saying that scanned document
content should be cleaned up – the more important consideration,
however, would be to clean up beyond the content. SPECIAL ALERT:
It is best to utilize scanned documents using the following procedure:
FileNew, letting Word create the binary file header, then InsertFile
to retrieve only the content of the scanned result. FileOpen invites
too many opportunities for failure. Even this method does not preclude
the need to perform thorough "document forensics" before
cloning the result and editing it!
4. Thou Shalt Properly Paginate
Any Document To Be Compared, OR Thou Shalt Make Sure the Document
is Completely Saved Or Thou Shalt Make Sure ‘Keep With Next’
Isn’t Assigned on More Paragraphs Than It Should Be.
These situations run hand in hand. We’ve heard lots of folks
advising comparisons of Word 97 documents, but only after the document
has been saved back to a Word 6/95 format. While indeed, many situations
can be alleviated by doing so, please know: upon performing such
a conversion, the resulting file is not completely ‘reflowed’
– what we as users know of as ‘paginated’ –
and CompareRite needs page numbers in order to do its stuff.
The other problem is that, often, our
documents are not ‘completely’ saved. This means that
the binary file which CompareRite is trying to operate on is primarily
strewn with document content, followed by a series of pointers and
complete user edits. Can you imagine how unwieldy – not to
mention huge – a document can get? This saving technique is
Word’s "Fast Save" technology, introduced way back
in the days of saving documents to floppy disks. While most installations
have this setting turned off upon migration to Word 97, and most
of us know it is a ‘bad’ feature to implement, we don’t
account for the fact that we inherit documents from clients, or
from our home PCs, where this setting is turned on as an installation
default (Tools/Options/Save)! And once a document has been Fast
Saved, it requires intervention to force a complete save, meshing
all the user edits with the remainder of the file.
Finally, any FileOpened converted document,
by default, is filled with ‘bloat’ – occurring
because the converter needs to reserve work space in which to perform
its revisions to the file format and content. Typically, a user
does not force a save on this document which their cursor positioned
at the end of the file – they continue to work on the document,
saving only at the points at which they’ve made edits. The
file remains bloated, then, for all of eternity.
To force a complete save, any of the following
methods would work:
Move to the end of the document, enter
a change on the last line, delete it, and save at the final paragraph
marker. This forces the document to ‘reflow’, issues
a pagination pass, and a complete save.
If you have saved back to Word 6/7, open the result in this application,
allowing the application to completely flow (paginate) the document,
move your cursor to the last paragraph marker, make an edit, delete
it, then save.
5. Thou Shalt Unlink All Field Codes, Turning Them to Text
– and While You’re At It: Remove Any {TA}/{TOA} Field
Codes from Thy Document Content.
The suggestion for unlinking of field codes is made for two primary
reasons: a) to reduce the possibility that field codes in the document
have been incorrectly created by the user: we’ve seen field
codes embedded within field codes, and we’ve also seen the
user erroneously edit them so that they are open-ended; and b) to
alleviate issues associated with Word 97’s {LISTNUM} field
code.
CompareRite is most sensitive to erroneous
field codes because it uses the RTF file format as both its source
and its destination formats. The RTF file format, largely an ASCII
file format, uses the "{" and "}" characters
which introduce such features as bold, or to define or apply styles.
If a field code in the original document is un-paired, it can then
cause the RTF conversion itself to go awry, since terminating
"}" characters have become out-of-step. Most often what
occurs in this situation is that you’ll end up with chunks
of text missing from your result.
In the case of Word’s {LISTNUM}
codes, what happens is that the RTF result delivers this display
field, but, alas, the definition of the scheme (the very list the
{LISTNUM} code is trying to display) is missing. What’s more,
it’s incorrect because the outline numbering linked to styles
in which has been sequenced have all been turned to text! It’s
levels, then, become one continuum list, and the result is even
more incorrect.
To unlink field codes in a document, work
on SaveAs versions of the originals, [Ctrl-A] to SelectAll, then
[Ctrl-Shift-F9], turning all field codes to text in the main body
of the document. Use these versions for your comparison, leaving
the original documents to maintain their dynamic formatting.
{TOA}/{TA} field codes in a document can
cause the document to report incorrect page numbers in Full Authority,
and can cause CompareRite to work harder, as well. It would be wise
to delete all {TA} fields codes (if you’re using Full Authority,
they’re needless!), and to unlink, or turn to text, the results
of any generated TOA you want to preserve. To do so:
Making sure that field code results are
displayed, locate the generated TOA, and select it. Press [Ctrl-Shift-F9].
This unlinks the dynamic field code and turns it to formatted text.
Now show field codes with [Alt-F9].
Edit/Find, looking for {^d TA} (some converted codes may or many
not have the space so you may perform two searches).
Delete and/or replace all {TA} fields with nothing.
6. Thou Shalt Practice Safe Pasting From Other Documents And Windows
Applications.
If you’re pasting text from word processing applications which
perform formatting (WordPerfect for Windows, or previous versions
of Word, for example), be mindful that the codes you carry into
the target document can indeed impact it in a negative way.
Why does this happen? What many people
do not realize is that the Clipboard – the location where
copied or cut text is stored – actually performs a conversion
on the data you’re pasting. To check it out: copy or cut something
in Word 97, then do Edit/PasteSpecial. You should see that, by default,
it navigates to "Formatted (RTF)". Edit/PasteSpecial/Formatted
Text and a regular Edit/Paste, then are one and the same thing!
If you’re pasting from applications
which contain formatting, consider using Edit/PasteSpecial as "Unformatted
Text" – if you’re copying from a tortured Word
97 file, in fact, consider using Edit/PasteSpecial as Unformatted
text. At a minimum, be aware that the codes and/or formatting you’re
moving into your documents may be not be desired and should be averted
or cleaned out.
7. Thou Shalt Not Permit Large
Populations of Directly-Formatted Paragraph Attributes, Empty Paragraph
Markers, or "Creepy" Section Breaks to Remain in Thy Documents.
Why? Word is a Styles-based word processor. It is designed to work
efficiently by managing the pagination and scrolling routines based
on the properties of the underlying style. If paragraph and font
formats are applied primarily using direct formatting, Word has
to work harder at managing these tasks. These sloppy habits, then,
can add unnecessary overhead to working in Word 97. What’s
more, this unnecessary formatting ‘bloats’ the saved
version of the document – exactly the file CompareRite must
convert and compare. Use of Styles, cleanly-formatted section breaks
– and using only those section breaks required in the document
– make the document more efficient, thus easier to convert,
compare and edit.
8. Thou Shalt Be Respectful When
Comparing Large Tables and Linked or Embedded "Stuff".
Indeed, CompareRite doesn’t compare large tabled documents
well in Word 97 – for that matter, it never did in WordPerfect
5.1, either. You might consider comparisons on these types of document
with the tables turned to text, or with table text marked to be
‘skipped’.
While our testing isn’t complete,
we’re somewhat suspicious, too, of lots of ‘linked’
stuff: hyperlinks (which create hidden bookmarks), links to other
file types – particularly graphics – or links to other
documents. If your document contains these complex features, the
RTF file may contain more binary data than ASCII, thus can be more
problematic. We’ll update you as we conclude our testing.
9. Thou Shalt Be Mindful That
Special Characters be Consistently Created, Especially in Footnotes.
Word 97 introduced an exciting new way of handling the myriad of
special symbols and accented language characters required in word
processing documents: Unicode. This emerging standard creates a
grid of 65,536 characters, each with its own, "unique code".
This enables applications to cease ‘sharing’ decimal
values as we have in the past, in order to produce the different
symbols we need to use. If you think about it, our keyboards can’t
possibly contain them all, so they have to be handled somehow!
The problem is that we bring text into
our documents from applications which pre-date the Unicode grid,
and one of the formats that’s had a hard time keeping up is
RTF itself. Remember that RTF is used by the Clipboard, thus pasting
special symbols into a document may or may not produce the appropriate
character in the document. (There’s also a whole lot of incorrectly
mapped WordPerfect symbols lying around in our documents, too…)
This problem will generally exhibit itself by displaying and/or
printing a small, square graphic box: "o ", or by displaying
the ‘correct’ symbol, yet when the character formatting
is stripped off ([Ctrl-SpaceBar]), this special character becomes
a square graphic box: "o "
We’ve seen a few CompareRite results
which, when run against documents where these special symbols were
never correct to begin with, produced pages of square graphic boxes
where alphabetic text should have been! What we can do to minimize
this kind of problem is to make sure that all characters in a document
are true Unicode symbols, and not allow converted symbols to remain
incorrect in our Word 97 documents. That, and remaining patient
until all the kinks associated to this significant change –
and this is indeed a significant change – are worked out.
Already Lexis and Microsoft have resolved many symbol issues: early
ones included the Copyright, Trademark, and open and closed quotes.
10. Thou Shalt Unburden All Documents
of Active or Dormant Revision Tracking.
We can’t tell you how many Word documents – some of
them coming all the way back from Word 6/7 lineage – cart
around ‘hidden’ native Word revisions. What generally
happens is that someone using the document turned on Revision Tracking,
then turned off Revision tracking without accepting or rejecting
the revisions they had made. What’s more, they’ve gone
so far as suppressing the changes from view! In the life cycle of
the document, no one notices they are there, and they’re carted
around in the file – and all of its derivatives – for
life. Make sure all Tracked Revisions are "Accepted".
Not only does this help embarrassing moments with the client who
discovers such buried treasure (!), but it helps CompareRite by
eliminating a whole lot of superfluous junk it has to plow through
in order to compare the text of the document.
Acknowledgements
Dedicated to:
Those Who Have Painfully Ferreted Out These Answers,
And To All Those Who must Troubleshoot, Defend, Minister,
uphold, or Recover From Violation of The Ten Commandments
Reported BY:
Sherry Kappel
Microsystems Engineering Company
sherryk@microsystems.com
February 12, 1999
With Affirmations From:
Amy Karnham, Lexis-Nexis (OH)
Doris Hibner, Rogers & Wells (NY)
Marc Harris, Winthrop Stimson (NY)
Bill Robertson, Softwise Consulting (NY)
Rolanda Carter, O’Melveny and Meyers (LA)
Fred Feldt, Haythe Curley (NY)
Judy Caputo, Kelley Drye (NY)
Suellen Miller, Sonnenschein (Chicago)
Ellen .Rosenstiel, Kutak Rock (Omaha)
and, last but not least:
Each of the 50 Million DocXchanged
Documents Created To Date
Microsoft, Word, Word97 and Windows95
are trademarks or registered trademarks of Microsoft Corporation.
Word Perfect is a trademark or registered trademark of Corel Corporation.
CompareRite is a trademark or registered trademark of Lexis-Nexis.