Opened 4 years ago

Closed 5 months ago

#9991 closed Task (fixed)

Paste from word should only normalize input data

Reported by: Reinmar Owned by: Tade0
Priority: Normal Milestone: CKEditor 4.6.0
Component: Plugin : Paste from Word Version:
Keywords: Drupal Cc: wim.leers@…

Description (last modified by j.swiderski)

Transformations and clean up are done by #9829 and #9989. Thus, PFW has to be rewritten so it won't double these core features.

Spec: https://gist.github.com/Reinmar/cc012b20bf37be84a523 (WIP)

Issues fixed by new implementation

#14660

Attachments (3)

whiteBg.png (105.0 KB) - added by m.lewandowski 13 months ago.
broken-list.docx (14.1 KB) - added by m.lewandowski 9 months ago.
List causing an exception
fixture-utility-manual.md (621 bytes) - added by Tade0 9 months ago.
Manual for the client side of the PFW fixture generator.

Download all attachments as: .zip

Change History (47)

comment:1 Changed 4 years ago by Reinmar

  • Status changed from new to confirmed

comment:2 Changed 4 years ago by Reinmar

  • Type changed from Bug to Task

comment:3 Changed 4 years ago by Reinmar

  • Description modified (diff)

comment:4 Changed 4 years ago by Anna

Adding #10267 as a test case for this issue.

comment:5 Changed 4 years ago by Reinmar

  • Milestone CKEditor 4.2 deleted

This ticket is about rewriting PFW from scratch and this is immense amount of work, so it will not be included in 4.2.

comment:6 Changed 4 years ago by Reinmar

  • Description modified (diff)

comment:7 Changed 4 years ago by Reinmar

#10327 should go before this ticket.

comment:8 Changed 4 years ago by Reinmar

<td style="background:Yellow">...</td>
or:
<span style="background:Yellow">...</span>

We need transformation that will extract background-color which we understand, from background which we remove. We should add these transformations for more CSS properties. I'm not only sure how can we automatically create them from e.g. config.colorButton_backStyle. We may need to add new settings for styles which would guide the filter.

Duplicate: #10364.

Last edited 4 years ago by Reinmar (previous) (diff)

comment:9 Changed 4 years ago by Reinmar

Another rooms for improvements can be find in the way we fixed #10348:

  • transformations could accept elements list instead of one element only,
  • transformations for align attribute on blocks is needed.

comment:10 Changed 4 years ago by j.swiderski

#10558 was marked as duplicate.

comment:11 Changed 3 years ago by Reinmar

  • Milestone set to CKEditor 4.4

comment:12 Changed 3 years ago by fredck

  • Component changed from Core : Pasting to Plugin : Paste from Word

comment:13 Changed 3 years ago by fredck

We’re not supposed to drastically change the way things are done right now. Most of the changes are to happen in the cleanup logic.

The following are some notes about how things (will) work.

  • Basic workflow:
    1. Listens for the “paste” with priority “3”
    2. Sniffs the paste date for word stuff.
    3. If from word, calls CKEDITOR.cleanWord.
    4. Updates the paste event data with the cleaned data.
  • It allows to have custom "cleanWord” implementations through configuration and on-demand file loading. I’m unsure about the point about this as one could eventually simply write a plugin that would override the default plugin behavior. Anyway, we don’t have to change this right now and it can be a v5 thing.
  • The default cleanWord implementation will not do the deep cleanup it does today. It’ll instead “normalize” the HTML in a way that it has proper semantics, using proper HTML features. The cleanup part will be left to ACF. This is all about this ticket, in fact.
  • The default cleanWord implementation is based on data processing, mainly.
  • What to do with the current configurations:
    • pasteFromWordPromptCleanup: I don’t feel that this is necessary any more… actually it may still exist but by default it may depend on ACF, if it’s enabled.
    • pasteFromWordRemoveFontStyles: we can keep this
    • pasteFromWordNumberedHeadingToList: ???
    • pasteFromWordRemoveStyles: I have the impression that this is not needed any more. Only if we want to keep it so one could decide to have cleaner paste from Word, even if the editor support style features.

So the current idea is developing tests and write the cleanup logic from scratch, reusing what’s is reusable from the current implementation.

comment:14 Changed 3 years ago by Wim Leers

Any idea on the ETA?

This is one of the things we need to ship Drupal 8. There's no beta yet, so there's still some time, but we definitely need to close https://drupal.org/node/1955300 before release.

comment:15 Changed 3 years ago by Reinmar

  • Milestone changed from CKEditor 4.4 to CKEditor 4.5

The ETA is 4.5 (so July or August - we may still change ETA for 4.5). More delay is very unlikely because of our plans regarding 4.5 and 5.

comment:16 Changed 2 years ago by Reinmar

  • Milestone changed from CKEditor 4.5.0 to CKEditor 4.6.0

We're sad when we have to do that, but we weren't able to find resources for Paste From Word for the last couple of months. We have to release the drag and drop and file upload support which is ready for in ~75% as soon as possible, so I'm postponing PFW to the next major release.

comment:18 Changed 19 months ago by Reinmar

  • Description modified (diff)

I wrote a first draft of the spec (or rather guidelines): https://gist.github.com/Reinmar/cc012b20bf37be84a523

comment:19 Changed 19 months ago by Tade0

  • Owner set to Tade0
  • Status changed from confirmed to assigned

comment:20 Changed 19 months ago by Wim Leers

Awesome!

comment:21 Changed 18 months ago by Tade0

The status of the work is as follows:

A set of virtual machines has been deployed, containing the most crucial combinations of MS Word/IE versions. An application is available on each of the VMs that accepts *.docx files as an input and outputs the HTML that would normally be generated when the contents of a Word file are pasted into the editor. An additional command-line utility enables remote access to the applications.

The point of this exercise is to automatically generate fixtures for unit tests. A growing number of example *.docx files is now available for processing.

Next step: Creating the first batch of tests using the generated fixtures.

comment:22 Changed 17 months ago by Tade0

  • Status changed from assigned to review

Status update:

Most of the core features of the plugin have been rewritten, as confirmed by 17 test cases(each one checked with Word 2007, 2013 and browsers: Chrome, Firefox, IE8, IE11).

comment:23 follow-up: Changed 17 months ago by Reinmar

  • Status changed from review to review_failed

I made a quick review of the code that you pushed to branch:t/9991. Here's the result:

  1. The fixtures MS Word files are missing in the repo.
  2. We need instructions on how to generate HTML fixtures from the MS Word files (not public, because it requires our infrastructure).
  3. The branch must be rebased before putting it on review.
  4. Tests should not be inside tests/tickets/9991. The proper place is inside PFW plugin's dir. Note that they should be separeted from the tests for the legacy filter.
  5. Tests for the previous filter should be tagged and configured to use the previous filter (they are failing right now).
  6. Remember about strict mode.
  7. Don't put a blank line at the beginning of a function or block (default.js L14, L336).
  8. Separate blocks with a blank line (applies to blocks like elements, elementNames, all element names in default.js).
  9. setSymbol() (switch block) – what if a list start with item "3" so no b., no II. etc. Using regexp than case item statements will be more flexible.
  10. L138 children[0] => children[ 0 ] (no JSCS rule for that because the existing one is broken).
  11. L159-177 - use one nope function.

However, more important is what has to be done with the task in general:

  1. We need to make sure that the new filter is at least as good as the old one. AFAICS the existing tests are oriented around the simple cases. We need to create tests for at least some of the fixed bugs in the old PFW filter. So the first thing that we need to do is to stabilise the new filter.
  2. The second thing is to actually implement the spec that I wrote (comment:18), because currently the new filter works in nearly the same manner as the old one, so it normalize the wordy-HTML to standard-HTML and then also it does the transformation and filtering which it should not do. So we need to take the current filter and move some of its code to ACF's transformations and filtering. This task can be done in a separate ticket, because this will be mostly transparent for the users. It will however make the PFW feature more configurable (through the ACF's settings).

comment:24 in reply to: ↑ 23 Changed 17 months ago by Tade0

Replying to Reinmar:

I made a quick review of the code that you pushed to branch:t/9991. Here's the result:

  1. The fixtures MS Word files are missing in the repo.
  2. We need instructions on how to generate HTML fixtures from the MS Word files (not public, because it requires our infrastructure).
  3. The branch must be rebased before putting it on review.
  4. Tests should not be inside tests/tickets/9991. The proper place is inside PFW plugin's dir. Note that they should be separeted from the tests for the legacy filter.
  5. Tests for the previous filter should be tagged and configured to use the previous filter (they are failing right now).
  6. Remember about strict mode.
  7. Don't put a blank line at the beginning of a function or block (default.js L14, L336).

We need a JSCS rule for that then, spanning the whole project.

  1. Separate blocks with a blank line (applies to blocks like elements, elementNames, all element names in default.js).

Same as above.

  1. setSymbol() (switch block) – what if a list start with item "3" so no b., no II. etc. Using regexp than case item statements will be more flexible.

Simply using regexps doesn't solve this. What if a list starts with an "i."? No way of telling if it's a roman numeral or a letter. And this is only the tip of the iceberg. This function was designed for the limited scope that this iteration of PFW has. I'll correct it as soon as we have the proper test cases.

comment:25 Changed 17 months ago by Reinmar

Note that introducing some JSCS rules may require fixing 1000 files. Although, simple stuff like this may, most likely, be fixed with automatic JSCS fixer.

comment:26 Changed 17 months ago by Reinmar

And feel free of course to propose such changes :)

comment:27 Changed 16 months ago by Reinmar

Simply using regexps doesn't solve this. What if a list starts with an "i."? No way of telling if it's a roman numeral or a letter.

It's an edge case. Check first for roman literars and later for alpha numeric case and you'll match well in satisfying number of cases.

comment:28 Changed 16 months ago by Tade0

  • Status changed from review_failed to review

Rebased and fixed. Changes pushed to branch:t/9991a.

comment:29 Changed 16 months ago by Reinmar

  • Status changed from review to review_failed

I've just checked the case from #7262 out of curiosity and the result is really bad in Chrome:

<pre style="text-align:justify">
<!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">&middot;      </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text1</span></span><!--![endif]--><!--![if--></pre>

<p><!--[if--><!--[endif]--></p>

<pre style="text-align:justify">
<!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">&middot;      </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text2</span></span><!--![endif]--><!--![if--></pre>

<p><!--[if--><!--[endif]--></p>

<pre style="text-align:justify">
<!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">&middot;      </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text3</span></span><!--![endif]--><!--![if--></pre>

<p><!--[if--><!--[endif]--><!--![endif]--><!--![if--><!--![endif]--><!--![if--><!--![endif]--><!--![if--></p>

As I mentioned – we need A LOT more tests before we're able to accept this new filter.

I remember we were discussing generating fixtures for all the docx files that we have. How's that work going?

comment:30 Changed 15 months ago by m.lewandowski

Problem with leaking comments seems to be fixed. Please cleanup unused branches on remote, rename t/9991b to t/9991 and remove all other branches from the remote.

Did some reviewing last couple of days and here are the notes so far:

problems:

  • failing tests,
  • font plugin:
    • ACF selectors must be as narrow as possible - this one essentially allows to put any mess inside span[style]. Same thing for other plugins.
    • add new transformation to existing font plugin features instead defining new entries
    • transformations should ensure that target transformation can be applied. Assune a situation when ACF is set in a way that it allows font[http://docs.ckeditor.com/#!/api/CKEDITOR.filter-method-addTransformations face] but does not allow span{font-family} with current implementation it would be converted to span (no matter what) and then... removed by ACF. You can do this with [check property].

  • plugin.js it should assign element.styles[ 'text-align' ],
  • default.js use hasClass instead,
  • default.js - I believe returning false does filter out the element, so you don't need to rely on internals (so that removing element name deletes the elem),

style:

  • empty comment in default.js,
  • instead disabling default.js jshint error use element.attributes[ 'class' ] for reserved keywords,
  • default.js always use braces + convertToFakeListItem should go to next line,
  • simply use ifs instead of doing one-liner expression like in plugin.js
Last edited 15 months ago by m.lewandowski (previous) (diff)

comment:31 Changed 15 months ago by Tade0

  • Status changed from review_failed to review

Fixed, changes pushed to branch:t/9991.

One thing that hasn't been changed was:

hasClass wouldn't work here, because the aim here is not to do an exact match, but to check whether there's a class that contains the string MsoListParagraph.

comment:32 Changed 13 months ago by m.lewandowski

  • you shouldn't create new CKEDITOR_MOCK global variable. Instead create CKEDITOR.plugins.pastefromword static namespace and put methods over there. However mark members that you don't want to be used outside of pfw as "@private".
    • Now within this namespace, create members like CKEDITOR.plugins.pastefromword.lists, where all list-related members should go.
    • createLists is too long and does too many things. You should extract multiple parts:
  • we can't be so eager to drop some attributes from resetStyles
    • white background - you could easily produce nested content that will be broken by this rule.

See whiteBg.png

  • this switch can be simplifed down to if. I can see that idea behind it might be for simplicity to extend down the road, but we can keep it simple here, refactorization will happen as needed.
  • let's simplify it down to normalizedStyles( element ) || false;
  • there area 2 "13590CkEditor-NumberList" directories in Tickets directory with a different casing. It looks like redundancy.
Last edited 13 months ago by m.lewandowski (previous) (diff)

Changed 13 months ago by m.lewandowski

comment:33 Changed 13 months ago by Tade0

Applied the changes from comment:32, also implemented the rest of the features that were required to cover all the closed tickets concerning PFW.

Some cases have inputs(chiefly from IE11) that still create conflicts. These tickets are: 7131, 7209, 7872, 7581("stupid list" case).

Most probably there is no real way to achieve the desired output in all these cases, so eventually we will have to decide which are more important and cover them.

Changes pushed to: branch:t/9991.

EDIT: rebased with major branch.

Last edited 13 months ago by Tade0 (previous) (diff)

comment:34 Changed 12 months ago by m.lewandowski

Note: It make sense to check if TC from #14544 is still valid here.

comment:35 Changed 9 months ago by j.swiderski

  • Description modified (diff)

Changed 9 months ago by m.lewandowski

List causing an exception

comment:36 Changed 9 months ago by m.lewandowski

I've been reviewing PFW again for a while now, and I found a case where pasting content would cause an exception. It's easy to fix for me, however I need to prepare proper tests to prevent us from regressions.

@Tade0 could you please provide some sort of instructions on generating a proper Word test?

I'm adding the docx fixture as an attachement ("broken-list.docx").

Changed 9 months ago by Tade0

Manual for the client side of the PFW fixture generator.

comment:37 Changed 9 months ago by Tade0

I've included an attachment that should serve as a guide how to generate tests.

comment:38 Changed 9 months ago by m.lewandowski

Test Custom_list_markers

I've fixed a bug where this TC would cause an exception. With that I discovered couple of things:

  • List should be converted to unordered list (it's identified as such in MS Word). Instead it's converted to two ordered lists, and list gets start value 19. That's actually a regression to a current implementation.
  • Since we're not supporting custom markers yet, we should filter out custom markers. Those could be recognized by:
    • Checking if src starts with "file://" protocol and have an alt equal to "*".

comment:39 follow-up: Changed 9 months ago by Tade0

Added this TC and made changes so that there is an unordered list there.

Changes pushed to branch:t/9991.

As for the src starting with file:// - regular images are also pasted like that, so it can't be counted as a distinguishing feature. That alt is uncommon enough though.

comment:40 in reply to: ↑ 39 ; follow-up: Changed 9 months ago by m.lewandowski

Replying to Tade0:

As for the src starting with file:// - regular images are also pasted like that, so it can't be counted as a distinguishing feature. That alt is uncommon enough though.

Yes, this is why both conditions must be met.

comment:41 in reply to: ↑ 40 Changed 9 months ago by Tade0

Replying to m.lewandowski:

Replying to Tade0:

As for the src starting with file:// - regular images are also pasted like that, so it can't be counted as a distinguishing feature. That alt is uncommon enough though.

Yes, this is why both conditions must be met.

Thing is, it's extremely hard not to meet that first condition, therefore it doesn't really check anything(it would still be possible to create a test case which would break here).

Alternatively, as an additional condition it could be checked whether the image is wrapped in a <span> with style="mso-list:Ignore" - that's a telltale sign that the image is used as a symbol.

comment:42 Changed 9 months ago by Tade0

Implemented the solution described in the previous comment.

Changes pushed to branch:t/9991.

comment:43 Changed 9 months ago by Tade0

Since the tests are prone to breaking due to the sheer amount of them I'll cut them into manageable pieces.

comment:44 Changed 5 months ago by m.lewandowski

  • Resolution set to fixed
  • Status changed from review to closed
Note: See TracTickets for help on using tickets.
© 2003 – 2016 CKSource – Frederico Knabben. All rights reserved. | Terms of use | Privacy policy