Opened 11 years ago

Closed 7 years ago

#9991 closed Task (fixed)

Paste from word should only normalize input data

Reported by: Piotrek Koszuliński Owned by: Tade0
Priority: Normal Milestone: CKEditor 4.6.0
Component: Plugin : Paste from Word Version:
Keywords: Drupal Cc: wim.leers@…

Description (last modified by Jakub Ś)

Transformations and clean up are done by #9829 and #9989. Thus, PFW has to be rewritten so it won't double these core features.

Spec: https://gist.github.com/Reinmar/cc012b20bf37be84a523 (WIP)

Issues fixed by new implementation

#14660

Attachments (3)

whiteBg.png (105.0 KB) - added by Marek Lewandowski 8 years ago.
broken-list.docx (14.1 KB) - added by Marek Lewandowski 8 years ago.
List causing an exception
fixture-utility-manual.md (621 bytes) - added by Tade0 8 years ago.
Manual for the client side of the PFW fixture generator.

Download all attachments as: .zip

Change History (47)

comment:1 Changed 11 years ago by Piotrek Koszuliński

Status: newconfirmed

comment:2 Changed 11 years ago by Piotrek Koszuliński

Type: BugTask

comment:3 Changed 11 years ago by Piotrek Koszuliński

Description: modified (diff)

comment:4 Changed 11 years ago by Anna Tomanek

Adding #10267 as a test case for this issue.

comment:5 Changed 11 years ago by Piotrek Koszuliński

Milestone: CKEditor 4.2

This ticket is about rewriting PFW from scratch and this is immense amount of work, so it will not be included in 4.2.

comment:6 Changed 11 years ago by Piotrek Koszuliński

Description: modified (diff)

comment:7 Changed 11 years ago by Piotrek Koszuliński

#10327 should go before this ticket.

comment:8 Changed 11 years ago by Piotrek Koszuliński

<td style="background:Yellow">...</td>
or:
<span style="background:Yellow">...</span>

We need transformation that will extract background-color which we understand, from background which we remove. We should add these transformations for more CSS properties. I'm not only sure how can we automatically create them from e.g. config.colorButton_backStyle. We may need to add new settings for styles which would guide the filter.

Duplicate: #10364.

Last edited 11 years ago by Piotrek Koszuliński (previous) (diff)

comment:9 Changed 11 years ago by Piotrek Koszuliński

Another rooms for improvements can be find in the way we fixed #10348:

  • transformations could accept elements list instead of one element only,
  • transformations for align attribute on blocks is needed.

comment:10 Changed 11 years ago by Jakub Ś

#10558 was marked as duplicate.

comment:11 Changed 10 years ago by Piotrek Koszuliński

Milestone: CKEditor 4.4

comment:12 Changed 10 years ago by Frederico Caldeira Knabben

Component: Core : PastingPlugin : Paste from Word

comment:13 Changed 10 years ago by Frederico Caldeira Knabben

We’re not supposed to drastically change the way things are done right now. Most of the changes are to happen in the cleanup logic.

The following are some notes about how things (will) work.

  • Basic workflow:
    1. Listens for the “paste” with priority “3”
    2. Sniffs the paste date for word stuff.
    3. If from word, calls CKEDITOR.cleanWord.
    4. Updates the paste event data with the cleaned data.
  • It allows to have custom "cleanWord” implementations through configuration and on-demand file loading. I’m unsure about the point about this as one could eventually simply write a plugin that would override the default plugin behavior. Anyway, we don’t have to change this right now and it can be a v5 thing.
  • The default cleanWord implementation will not do the deep cleanup it does today. It’ll instead “normalize” the HTML in a way that it has proper semantics, using proper HTML features. The cleanup part will be left to ACF. This is all about this ticket, in fact.
  • The default cleanWord implementation is based on data processing, mainly.
  • What to do with the current configurations:
    • pasteFromWordPromptCleanup: I don’t feel that this is necessary any more… actually it may still exist but by default it may depend on ACF, if it’s enabled.
    • pasteFromWordRemoveFontStyles: we can keep this
    • pasteFromWordNumberedHeadingToList: ???
    • pasteFromWordRemoveStyles: I have the impression that this is not needed any more. Only if we want to keep it so one could decide to have cleaner paste from Word, even if the editor support style features.

So the current idea is developing tests and write the cleanup logic from scratch, reusing what’s is reusable from the current implementation.

comment:14 Changed 10 years ago by Wim Leers

Any idea on the ETA?

This is one of the things we need to ship Drupal 8. There's no beta yet, so there's still some time, but we definitely need to close https://drupal.org/node/1955300 before release.

comment:15 Changed 10 years ago by Piotrek Koszuliński

Milestone: CKEditor 4.4CKEditor 4.5

The ETA is 4.5 (so July or August - we may still change ETA for 4.5). More delay is very unlikely because of our plans regarding 4.5 and 5.

comment:16 Changed 10 years ago by Piotrek Koszuliński

Milestone: CKEditor 4.5.0CKEditor 4.6.0

We're sad when we have to do that, but we weren't able to find resources for Paste From Word for the last couple of months. We have to release the drag and drop and file upload support which is ready for in ~75% as soon as possible, so I'm postponing PFW to the next major release.

comment:17 Changed 9 years ago by Piotrek Koszuliński

comment:18 Changed 9 years ago by Piotrek Koszuliński

Description: modified (diff)

I wrote a first draft of the spec (or rather guidelines): https://gist.github.com/Reinmar/cc012b20bf37be84a523

comment:19 Changed 9 years ago by Tade0

Owner: set to Tade0
Status: confirmedassigned

comment:20 Changed 9 years ago by Wim Leers

Awesome!

comment:21 Changed 9 years ago by Tade0

The status of the work is as follows:

A set of virtual machines has been deployed, containing the most crucial combinations of MS Word/IE versions. An application is available on each of the VMs that accepts *.docx files as an input and outputs the HTML that would normally be generated when the contents of a Word file are pasted into the editor. An additional command-line utility enables remote access to the applications.

The point of this exercise is to automatically generate fixtures for unit tests. A growing number of example *.docx files is now available for processing.

Next step: Creating the first batch of tests using the generated fixtures.

comment:22 Changed 8 years ago by Tade0

Status: assignedreview

Status update:

Most of the core features of the plugin have been rewritten, as confirmed by 17 test cases(each one checked with Word 2007, 2013 and browsers: Chrome, Firefox, IE8, IE11).

comment:23 Changed 8 years ago by Piotrek Koszuliński

Status: reviewreview_failed

I made a quick review of the code that you pushed to branch:t/9991. Here's the result:

  1. The fixtures MS Word files are missing in the repo.
  2. We need instructions on how to generate HTML fixtures from the MS Word files (not public, because it requires our infrastructure).
  3. The branch must be rebased before putting it on review.
  4. Tests should not be inside tests/tickets/9991. The proper place is inside PFW plugin's dir. Note that they should be separeted from the tests for the legacy filter.
  5. Tests for the previous filter should be tagged and configured to use the previous filter (they are failing right now).
  6. Remember about strict mode.
  7. Don't put a blank line at the beginning of a function or block (default.js L14, L336).
  8. Separate blocks with a blank line (applies to blocks like elements, elementNames, all element names in default.js).
  9. setSymbol() (switch block) – what if a list start with item "3" so no b., no II. etc. Using regexp than case item statements will be more flexible.
  10. L138 children[0] => children[ 0 ] (no JSCS rule for that because the existing one is broken).
  11. L159-177 - use one nope function.

However, more important is what has to be done with the task in general:

  1. We need to make sure that the new filter is at least as good as the old one. AFAICS the existing tests are oriented around the simple cases. We need to create tests for at least some of the fixed bugs in the old PFW filter. So the first thing that we need to do is to stabilise the new filter.
  2. The second thing is to actually implement the spec that I wrote (comment:18), because currently the new filter works in nearly the same manner as the old one, so it normalize the wordy-HTML to standard-HTML and then also it does the transformation and filtering which it should not do. So we need to take the current filter and move some of its code to ACF's transformations and filtering. This task can be done in a separate ticket, because this will be mostly transparent for the users. It will however make the PFW feature more configurable (through the ACF's settings).

comment:24 in reply to:  23 Changed 8 years ago by Tade0

Replying to Reinmar:

I made a quick review of the code that you pushed to branch:t/9991. Here's the result:

  1. The fixtures MS Word files are missing in the repo.
  2. We need instructions on how to generate HTML fixtures from the MS Word files (not public, because it requires our infrastructure).
  3. The branch must be rebased before putting it on review.
  4. Tests should not be inside tests/tickets/9991. The proper place is inside PFW plugin's dir. Note that they should be separeted from the tests for the legacy filter.
  5. Tests for the previous filter should be tagged and configured to use the previous filter (they are failing right now).
  6. Remember about strict mode.
  7. Don't put a blank line at the beginning of a function or block (default.js L14, L336).

We need a JSCS rule for that then, spanning the whole project.

  1. Separate blocks with a blank line (applies to blocks like elements, elementNames, all element names in default.js).

Same as above.

  1. setSymbol() (switch block) – what if a list start with item "3" so no b., no II. etc. Using regexp than case item statements will be more flexible.

Simply using regexps doesn't solve this. What if a list starts with an "i."? No way of telling if it's a roman numeral or a letter. And this is only the tip of the iceberg. This function was designed for the limited scope that this iteration of PFW has. I'll correct it as soon as we have the proper test cases.

comment:25 Changed 8 years ago by Piotrek Koszuliński

Note that introducing some JSCS rules may require fixing 1000 files. Although, simple stuff like this may, most likely, be fixed with automatic JSCS fixer.

comment:26 Changed 8 years ago by Piotrek Koszuliński

And feel free of course to propose such changes :)

comment:27 Changed 8 years ago by Piotrek Koszuliński

Simply using regexps doesn't solve this. What if a list starts with an "i."? No way of telling if it's a roman numeral or a letter.

It's an edge case. Check first for roman literars and later for alpha numeric case and you'll match well in satisfying number of cases.

comment:28 Changed 8 years ago by Tade0

Status: review_failedreview

Rebased and fixed. Changes pushed to branch:t/9991a.

comment:29 Changed 8 years ago by Piotrek Koszuliński

Status: reviewreview_failed

I've just checked the case from #7262 out of curiosity and the result is really bad in Chrome:

<pre style="text-align:justify">
<!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">&middot;      </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text1</span></span><!--![endif]--><!--![if--></pre>

<p><!--[if--><!--[endif]--></p>

<pre style="text-align:justify">
<!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">&middot;      </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text2</span></span><!--![endif]--><!--![if--></pre>

<p><!--[if--><!--[endif]--></p>

<pre style="text-align:justify">
<!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">&middot;      </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text3</span></span><!--![endif]--><!--![if--></pre>

<p><!--[if--><!--[endif]--><!--![endif]--><!--![if--><!--![endif]--><!--![if--><!--![endif]--><!--![if--></p>

As I mentioned – we need A LOT more tests before we're able to accept this new filter.

I remember we were discussing generating fixtures for all the docx files that we have. How's that work going?

comment:30 Changed 8 years ago by Marek Lewandowski

Problem with leaking comments seems to be fixed. Please cleanup unused branches on remote, rename t/9991b to t/9991 and remove all other branches from the remote.

Did some reviewing last couple of days and here are the notes so far:

problems:

  • failing tests,
  • font plugin:
    • ACF selectors must be as narrow as possible - this one essentially allows to put any mess inside span[style]. Same thing for other plugins.
    • add new transformation to existing font plugin features instead defining new entries
    • transformations should ensure that target transformation can be applied. Assune a situation when ACF is set in a way that it allows font[http://docs.ckeditor.com/#!/api/CKEDITOR.filter-method-addTransformations face] but does not allow span{font-family} with current implementation it would be converted to span (no matter what) and then... removed by ACF. You can do this with [check property].

  • plugin.js it should assign element.styles[ 'text-align' ],
  • default.js use hasClass instead,
  • default.js - I believe returning false does filter out the element, so you don't need to rely on internals (so that removing element name deletes the elem),

style:

Version 0, edited 8 years ago by Marek Lewandowski (next)

comment:31 Changed 8 years ago by Tade0

Status: review_failedreview

Fixed, changes pushed to branch:t/9991.

One thing that hasn't been changed was:

hasClass wouldn't work here, because the aim here is not to do an exact match, but to check whether there's a class that contains the string MsoListParagraph.

comment:32 Changed 8 years ago by Marek Lewandowski

  • you shouldn't create new CKEDITOR_MOCK global variable. Instead create CKEDITOR.plugins.pastefromword static namespace and put methods over there. However mark members that you don't want to be used outside of pfw as "@private".
    • Now within this namespace, create members like CKEDITOR.plugins.pastefromword.lists, where all list-related members should go.
    • createLists is too long and does too many things. You should extract multiple parts:
  • we can't be so eager to drop some attributes from resetStyles
    • white background - you could easily produce nested content that will be broken by this rule.

See whiteBg.png

  • this switch can be simplifed down to if. I can see that idea behind it might be for simplicity to extend down the road, but we can keep it simple here, refactorization will happen as needed.
  • let's simplify it down to normalizedStyles( element ) || false;
  • there area 2 "13590CkEditor-NumberList" directories in Tickets directory with a different casing. It looks like redundancy.
Last edited 8 years ago by Marek Lewandowski (previous) (diff)

Changed 8 years ago by Marek Lewandowski

Attachment: whiteBg.png added

comment:33 Changed 8 years ago by Tade0

Applied the changes from comment:32, also implemented the rest of the features that were required to cover all the closed tickets concerning PFW.

Some cases have inputs(chiefly from IE11) that still create conflicts. These tickets are: 7131, 7209, 7872, 7581("stupid list" case).

Most probably there is no real way to achieve the desired output in all these cases, so eventually we will have to decide which are more important and cover them.

Changes pushed to: branch:t/9991.

EDIT: rebased with major branch.

Last edited 8 years ago by Tade0 (previous) (diff)

comment:34 Changed 8 years ago by Marek Lewandowski

Note: It make sense to check if TC from #14544 is still valid here.

comment:35 Changed 8 years ago by Jakub Ś

Description: modified (diff)

Changed 8 years ago by Marek Lewandowski

Attachment: broken-list.docx added

List causing an exception

comment:36 Changed 8 years ago by Marek Lewandowski

I've been reviewing PFW again for a while now, and I found a case where pasting content would cause an exception. It's easy to fix for me, however I need to prepare proper tests to prevent us from regressions.

@Tade0 could you please provide some sort of instructions on generating a proper Word test?

I'm adding the docx fixture as an attachement ("broken-list.docx").

Changed 8 years ago by Tade0

Attachment: fixture-utility-manual.md added

Manual for the client side of the PFW fixture generator.

comment:37 Changed 8 years ago by Tade0

I've included an attachment that should serve as a guide how to generate tests.

comment:38 Changed 8 years ago by Marek Lewandowski

Test Custom_list_markers

I've fixed a bug where this TC would cause an exception. With that I discovered couple of things:

  • List should be converted to unordered list (it's identified as such in MS Word). Instead it's converted to two ordered lists, and list gets start value 19. That's actually a regression to a current implementation.
  • Since we're not supporting custom markers yet, we should filter out custom markers. Those could be recognized by:
    • Checking if src starts with "file://" protocol and have an alt equal to "*".

comment:39 Changed 8 years ago by Tade0

Added this TC and made changes so that there is an unordered list there.

Changes pushed to branch:t/9991.

As for the src starting with file:// - regular images are also pasted like that, so it can't be counted as a distinguishing feature. That alt is uncommon enough though.

comment:40 in reply to:  39 ; Changed 8 years ago by Marek Lewandowski

Replying to Tade0:

As for the src starting with file:// - regular images are also pasted like that, so it can't be counted as a distinguishing feature. That alt is uncommon enough though.

Yes, this is why both conditions must be met.

comment:41 in reply to:  40 Changed 8 years ago by Tade0

Replying to m.lewandowski:

Replying to Tade0:

As for the src starting with file:// - regular images are also pasted like that, so it can't be counted as a distinguishing feature. That alt is uncommon enough though.

Yes, this is why both conditions must be met.

Thing is, it's extremely hard not to meet that first condition, therefore it doesn't really check anything(it would still be possible to create a test case which would break here).

Alternatively, as an additional condition it could be checked whether the image is wrapped in a <span> with style="mso-list:Ignore" - that's a telltale sign that the image is used as a symbol.

comment:42 Changed 8 years ago by Tade0

Implemented the solution described in the previous comment.

Changes pushed to branch:t/9991.

comment:43 Changed 8 years ago by Tade0

Since the tests are prone to breaking due to the sheer amount of them I'll cut them into manageable pieces.

comment:44 Changed 7 years ago by Marek Lewandowski

Resolution: fixed
Status: reviewclosed
Note: See TracTickets for help on using tickets.
© 2003 – 2022, CKSource sp. z o.o. sp.k. All rights reserved. | Terms of use | Privacy policy