wiki:MSWordFilter

Pasting from Microsoft Office Applications

When pasting formatted content from applications of Office family products including Word,Excel,Outlook, etc... CKEditor perform similar filtering as what Office 2000 HTML Filter do - automatically cleanup Office-specific markup tags to reduce the final content size, in advance, it will also perform intelligent transformation to help generating more semantically correct markups in HTML/XHTML (e.g. list structure), while trying to preserve as much format as possible from original application.
The following plugins are required depends on demand:

  1. pastefromword plugin must present to deliver this functionality, when using it, you need to click the Paste From Word toolbar button to instruct the editor your clipboard content of this pasting is from MS Office applications.
  2. In order to auto-detect whether content is from Office application, the clipboard plugin must be configured, in this case, you just need to press keyboard 'Ctrl-V' or click on Paste toolbar button.

Browser Native Filtering

Each web browsers have it's own pasting system which may vary from the result that content been pasted from Office Word application, generally certain styles even text would loose during the pasting, the following table lists the known filter outs:

BrowsersIEFirefoxSafari/ChromeOpera
Document Structure StylesNOYesNONO

CKEditor Filtering

Further processing is handled by the editor based on browser's filtering result,the following table summarizes the rules that will affect the end result:

Format Removed Result Affected
Downlevel conditional comments content within
<!--[
and
]-->

Example

<!--[if gte mso 9]>...<![endif]-->
WordArt cannot be edited, only the resulting static image is left. These comments make some HTML markup invisible to browsers earlier than Microsoft Internet Explorer 5.

For example, Office inserts XML blocks containing WordArt document properties inside these comments so that the contents of these XML elements do not show up as text in browsers earlier than Internet Explorer 5.

Uplevel conditional comments within
 
<![
and
 
]>

Example

 
<! [if !vml]>
These comments make some HTML markup visible in browsers earlier than Internet Explorer 5 but invisible in Internet Explorer 5 or later. When the comments are removed, the markup indicating that static images should not be loaded in Internet Explorer 5 or later is lost.

For example, WordArt is saved as HTML in two parts. One part is an XML block that describes the image. The other part is an actual image that makes the picture visible in older browsers that don't interpret XML. The static image is put inside uplevel comments to hide it from Internet Explorer 5 or later.

XML tags in the "o", "v", "w", "x", and "p" namespaces

Example

 
<o:p></o:p>
Paragraph mark formatting (if different from the paragraph) is lost. The
<o:p></o:p>
tags represent the character that Word treats as the paragraph mark.
@-rule definitions

Example

 
@page Section1
               {size: 8.5in 11in }
Page settings, such as page dimensions and orientation, are lost:
  • @page contains document page setup information
  • @font-face contains document font definitions
  • @list contains Office-specific bulleted and numbered list styles definitions

To keep standard @ rule defintions, @page and @font-face, use the -a switch at the command prompt.

CSS comments containing /* and */

Example

 
/* List Definitions */
Minimal impact on HTML document.
VML attributes, or any attribute with a colon ( : ) in the attribute name

Example

 
v:shapes="_x000_i1025"
WordArt, clip art, and AutoShapes cannot be edited; only the resulting static image is left.
ProgID
 
<meta>
tags

Example

 
<meta name=ProgID content=Word.Document>
Minimal impact on HTML document. ProgID identifies the application the file was created in.

You can also remove GENERATOR and ORIGINATOR META tags, which contain the information about the HTML document's originating program (for example, Word or Excel) and the latest generating program (Office HTML Filter). To remove the GENERATOR and ORIGINATOR META tags, use the -m switch at the command prompt.

Link elements with the rel attribute set to any of the following:
  • "file-list"
  • "edit-time-data"
  • "ole-object-data"
  • "original-file"
  • "preview"

Example

<link rel=File-Listhref="./mydoc_files/filelist.xml">
The association with all the special extra files that contain Office-specific data, such as OLE object binaries, is lost.
The following XML namespace declarations - that is, the xmlns attribute setting:
  • "o"
  • "w"
  • "x"
  • "p"
  • "v"

Example

 
xmlns:v="urn:schemas-microsoft-com:vml"
The ability to render WordArt and clip art as vector images in the browser is lost. Instead, they become static images.

To keep VML in the file, use the -v switch at the command prompt.

If either -o or -v is used at the command prompt, the XML namespace declarations remain in the file.

Empty style attributes, especially when they become empty as a result of processing their values

Example

 
style=""
Minimal impact on HTML document.
"mso-" prefix properties

Example

 
mso-margin-top-alt: 12pt;
Office-specific formatting that stores Office document settings, which are are used when the HTML document is opened in Office. Some features, such as footnotes and customized bullet and numbering are lost. Word legacy frames become tables, and some edit-time language and font-formatting information is lost.

To keep mso- prefix properties and other Office-specific properties, use the -o switch at the command prompt.

Other non-standard properties such as:
  • "tab-stops"
  • "tab-interval"
  • "language"
  • "text-underline"
  • "text-effect"
  • "text-line-through"
  • "font-color"
  • "horiz-align"
  • "list-image-1"
  • "list-image-2"
  • "list-image-3"
  • "separator-image"
  • "table-border-color-dark"
  • "table-border-color-light"
  • "vert-align"
  • "vnd.ms-excel.numberformat"

Example

 
tab-interval: .5in;
Tab settings are lost. All text underline styles become single underline. All underline colors become black. Engraved text and embossed text are lost.
Empty inline HTML elements: FONT, EM, STRONG, SAMP, ACRONYM, CITE, CODE, DFN, KBD, TT, B, I, U, S, SUB, SUP, INS, DEL, VAR, SPAN. An element is considered empty if it contains no displayable contents.

Example

 
<FONT COLOR=blue><B></B></FONT>
No impact on the display of the HTML document.
Last modified 15 years ago Last modified on Nov 27, 2009, 3:28:12 PM
© 2003 – 2022, CKSource sp. z o.o. sp.k. All rights reserved. | Terms of use | Privacy policy