Opened 8 years ago

Closed 7 years ago

#4395 closed New Feature (fixed)

Use htmldataprocessor to refactor pasting processor

Reported by: garry.yao Owned by: garry.yao
Priority: Normal Milestone: CKEditor 3.1
Component: General Version:
Keywords: Paste Confirmed Cc:

Description (last modified by garry.yao)

We should start using htmldataprocessor when processing with the pasting input, instead of current implementation which based on regexp exclusively, such a infrustructure would bring benefits in many sense:

  1. Allow structure transformation to happen easily toward the source instead of simply cleanup, e.g. MS-WORD created middot bullet -> HTML unordered list;
  2. Leveraging all the existing rules we currently have for output, e.g. flash object, namespaces tags;
  3. It will be much more easy for developer to extend/customize by adding/altering the rules.

Change History (5)

comment:1 Changed 8 years ago by garry.yao

  • Description modified (diff)
  • Status changed from new to assigned
  • Summary changed from Use htmldataprocessor to refactor pasting clean up to Use htmldataprocessor to refactor pasting processor

comment:2 Changed 8 years ago by garry.yao

  • Keywords Paste added

Changes committed with [4207] in pasting branch.

comment:3 Changed 8 years ago by garry.yao

Migrate all the regexp based rules in 'cleanWord' function to be based on filter rules with [4208].

comment:4 Changed 8 years ago by garry.yao

It's noticed that there's one significant impedance mismatch between the old regexp based and the current filter based one:
The old approach is linear, multiple-pass parsing, while our html filter is a top-down, one-pass procedure, which make difficulties for some of the rule's migration.

Considering the following example, which should be correctly cleaned up as a single  .

	<span lang=EN-GB style='font-family:Calibri'>
		<o:p> &nbsp;</o:p>

The old rules related to this were:

html = html.replace(/<o:p>\s*<\/o:p>/g, '') ;
html = html.replace(/<o:p>[\s\S]*?<\/o:p>/g, '&nbsp;') ;
html = html.replace( /<SPAN\s*[^>]*>\s*&nbsp;\s*<\/SPAN>/gi, '&nbsp;' ) ;
html = html.replace( /<SPAN\s*[^>]*><\/SPAN>/gi, '' ) ;

The new rules would ideally be the following but actually was wrong because the 'span' rule will always be execute first( determinate by tree order ):

elements :
	$ : function( element )
		var tagName =;

		if( tagName == 'span' )
			var child;
			if ( ( child = onlyChildOf( element ) )
				 && /(:?\s|&nbsp;)+/.exec( child.value ) )
				...Drop this element, preserve childs...
		else if( tagName == 'o:p' )
			...Drop this element, preserve childs...

In such case, the filter must have one mechanism to properly perform the filtering from bottom to top( allow children to be filtered before itself ), in this concrete example will execute the <o:p> rule, then the <span> rule.

I'm adding one function CKEDITOR.htmlParser.element::filterChildren to allow this happen like the following when necessary, changes were checked in at the pasting branch with [4218].

if( tagName == 'span' )
	// Filter down the childrens first.

	var child;
	if ( ( child = onlyChildOf( element ) )
		 && /(:?\s|&nbsp;)+/.exec( child.value ) )
		...Drop this element, preserve childs...

comment:5 Changed 7 years ago by garry.yao

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.
© 2003 – 2016 CKSource – Frederico Knabben. All rights reserved. | Terms of use | Privacy policy