Opened 12 years ago
Closed 8 years ago
#9991 closed Task (fixed)
Paste from word should only normalize input data
Reported by: | Piotrek Koszuliński | Owned by: | Tade0 |
---|---|---|---|
Priority: | Normal | Milestone: | CKEditor 4.6.0 |
Component: | Plugin : Paste from Word | Version: | |
Keywords: | Drupal | Cc: | wim.leers@… |
Description (last modified by )
Transformations and clean up are done by #9829 and #9989. Thus, PFW has to be rewritten so it won't double these core features.
Spec: https://gist.github.com/Reinmar/cc012b20bf37be84a523 (WIP)
Issues fixed by new implementation
Attachments (3)
Change History (47)
comment:1 Changed 12 years ago by
Status: | new → confirmed |
---|
comment:2 Changed 12 years ago by
Type: | Bug → Task |
---|
comment:3 Changed 12 years ago by
Description: | modified (diff) |
---|
comment:4 Changed 12 years ago by
comment:5 Changed 12 years ago by
Milestone: | CKEditor 4.2 |
---|
This ticket is about rewriting PFW from scratch and this is immense amount of work, so it will not be included in 4.2.
comment:6 Changed 12 years ago by
Description: | modified (diff) |
---|
comment:8 Changed 12 years ago by
<td style="background:Yellow">...</td> or: <span style="background:Yellow">...</span>
We need transformation that will extract background-color which we understand, from background which we remove. We should add these transformations for more CSS properties. I'm not only sure how can we automatically create them from e.g. config.colorButton_backStyle
. We may need to add new settings for styles which would guide the filter.
Duplicate: #10364.
comment:9 Changed 12 years ago by
Another rooms for improvements can be find in the way we fixed #10348:
- transformations could accept elements list instead of one element only,
- transformations for align attribute on blocks is needed.
comment:11 Changed 11 years ago by
Milestone: | → CKEditor 4.4 |
---|
comment:12 Changed 11 years ago by
Component: | Core : Pasting → Plugin : Paste from Word |
---|
comment:13 Changed 11 years ago by
We’re not supposed to drastically change the way things are done right now. Most of the changes are to happen in the cleanup logic.
The following are some notes about how things (will) work.
- Basic workflow:
- Listens for the “paste” with priority “3”
- Sniffs the paste date for word stuff.
- If from word, calls CKEDITOR.cleanWord.
- Updates the paste event data with the cleaned data.
- It allows to have custom "cleanWord” implementations through configuration and on-demand file loading. I’m unsure about the point about this as one could eventually simply write a plugin that would override the default plugin behavior. Anyway, we don’t have to change this right now and it can be a v5 thing.
- The default cleanWord implementation will not do the deep cleanup it does today. It’ll instead “normalize” the HTML in a way that it has proper semantics, using proper HTML features. The cleanup part will be left to ACF. This is all about this ticket, in fact.
- The default cleanWord implementation is based on data processing, mainly.
- What to do with the current configurations:
- pasteFromWordPromptCleanup: I don’t feel that this is necessary any more… actually it may still exist but by default it may depend on ACF, if it’s enabled.
- pasteFromWordRemoveFontStyles: we can keep this
- pasteFromWordNumberedHeadingToList: ???
- pasteFromWordRemoveStyles: I have the impression that this is not needed any more. Only if we want to keep it so one could decide to have cleaner paste from Word, even if the editor support style features.
So the current idea is developing tests and write the cleanup logic from scratch, reusing what’s is reusable from the current implementation.
comment:14 Changed 11 years ago by
Any idea on the ETA?
This is one of the things we need to ship Drupal 8. There's no beta yet, so there's still some time, but we definitely need to close https://drupal.org/node/1955300 before release.
comment:15 Changed 11 years ago by
Milestone: | CKEditor 4.4 → CKEditor 4.5 |
---|
The ETA is 4.5 (so July or August - we may still change ETA for 4.5). More delay is very unlikely because of our plans regarding 4.5 and 5.
comment:16 Changed 10 years ago by
Milestone: | CKEditor 4.5.0 → CKEditor 4.6.0 |
---|
We're sad when we have to do that, but we weren't able to find resources for Paste From Word for the last couple of months. We have to release the drag and drop and file upload support which is ready for in ~75% as soon as possible, so I'm postponing PFW to the next major release.
comment:17 Changed 9 years ago by
This may be still a useful resource https://msdn.microsoft.com/en-us/library/Aa155477?f=255&MSPPError=-2147217396
comment:18 Changed 9 years ago by
Description: | modified (diff) |
---|
I wrote a first draft of the spec (or rather guidelines): https://gist.github.com/Reinmar/cc012b20bf37be84a523
comment:19 Changed 9 years ago by
Owner: | set to Tade0 |
---|---|
Status: | confirmed → assigned |
comment:21 Changed 9 years ago by
The status of the work is as follows:
A set of virtual machines has been deployed, containing the most crucial combinations of MS Word/IE versions. An application is available on each of the VMs that accepts *.docx files as an input and outputs the HTML that would normally be generated when the contents of a Word file are pasted into the editor. An additional command-line utility enables remote access to the applications.
The point of this exercise is to automatically generate fixtures for unit tests. A growing number of example *.docx files is now available for processing.
Next step: Creating the first batch of tests using the generated fixtures.
comment:22 Changed 9 years ago by
Status: | assigned → review |
---|
Status update:
Most of the core features of the plugin have been rewritten, as confirmed by 17 test cases(each one checked with Word 2007, 2013 and browsers: Chrome, Firefox, IE8, IE11).
comment:23 follow-up: 24 Changed 9 years ago by
Status: | review → review_failed |
---|
I made a quick review of the code that you pushed to branch:t/9991. Here's the result:
- The fixtures MS Word files are missing in the repo.
- We need instructions on how to generate HTML fixtures from the MS Word files (not public, because it requires our infrastructure).
- The branch must be rebased before putting it on review.
- Tests should not be inside
tests/tickets/9991
. The proper place is inside PFW plugin's dir. Note that they should be separeted from the tests for the legacy filter. - Tests for the previous filter should be tagged and configured to use the previous filter (they are failing right now).
- Remember about strict mode.
- Don't put a blank line at the beginning of a function or block (default.js L14, L336).
- Separate blocks with a blank line (applies to blocks like
elements
,elementNames
, all element names in default.js). setSymbol()
(switch block) – what if a list start with item "3" so nob.
, noII.
etc. Using regexp thancase item
statements will be more flexible.- L138
children[0]
=>children[ 0 ]
(no JSCS rule for that because the existing one is broken). - L159-177 - use one
nope
function.
However, more important is what has to be done with the task in general:
- We need to make sure that the new filter is at least as good as the old one. AFAICS the existing tests are oriented around the simple cases. We need to create tests for at least some of the fixed bugs in the old PFW filter. So the first thing that we need to do is to stabilise the new filter.
- The second thing is to actually implement the spec that I wrote (comment:18), because currently the new filter works in nearly the same manner as the old one, so it normalize the wordy-HTML to standard-HTML and then also it does the transformation and filtering which it should not do. So we need to take the current filter and move some of its code to ACF's transformations and filtering. This task can be done in a separate ticket, because this will be mostly transparent for the users. It will however make the PFW feature more configurable (through the ACF's settings).
comment:24 Changed 9 years ago by
Replying to Reinmar:
I made a quick review of the code that you pushed to branch:t/9991. Here's the result:
- The fixtures MS Word files are missing in the repo.
- We need instructions on how to generate HTML fixtures from the MS Word files (not public, because it requires our infrastructure).
- The branch must be rebased before putting it on review.
- Tests should not be inside
tests/tickets/9991
. The proper place is inside PFW plugin's dir. Note that they should be separeted from the tests for the legacy filter.- Tests for the previous filter should be tagged and configured to use the previous filter (they are failing right now).
- Remember about strict mode.
- Don't put a blank line at the beginning of a function or block (default.js L14, L336).
We need a JSCS rule for that then, spanning the whole project.
- Separate blocks with a blank line (applies to blocks like
elements
,elementNames
, all element names in default.js).
Same as above.
setSymbol()
(switch block) – what if a list start with item "3" so nob.
, noII.
etc. Using regexp thancase item
statements will be more flexible.
Simply using regexps doesn't solve this. What if a list starts with an "i."? No way of telling if it's a roman numeral or a letter. And this is only the tip of the iceberg. This function was designed for the limited scope that this iteration of PFW has. I'll correct it as soon as we have the proper test cases.
comment:25 Changed 9 years ago by
Note that introducing some JSCS rules may require fixing 1000 files. Although, simple stuff like this may, most likely, be fixed with automatic JSCS fixer.
comment:27 Changed 9 years ago by
Simply using regexps doesn't solve this. What if a list starts with an "i."? No way of telling if it's a roman numeral or a letter.
It's an edge case. Check first for roman literars and later for alpha numeric case and you'll match well in satisfying number of cases.
comment:28 Changed 9 years ago by
Status: | review_failed → review |
---|
Rebased and fixed. Changes pushed to branch:t/9991a.
comment:29 Changed 9 years ago by
Status: | review → review_failed |
---|
I've just checked the case from #7262 out of curiosity and the result is really bad in Chrome:
<pre style="text-align:justify"> <!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">· </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text1</span></span><!--![endif]--><!--![if--></pre> <p><!--[if--><!--[endif]--></p> <pre style="text-align:justify"> <!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">· </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text2</span></span><!--![endif]--><!--![if--></pre> <p><!--[if--><!--[endif]--></p> <pre style="text-align:justify"> <!--[if--><span style="font-family:symbol"><span style="font-size:12.0pt">· </span></span><!--[endif]--><span style="font-family:tahoma"><span style="font-size:12.0pt">Text3</span></span><!--![endif]--><!--![if--></pre> <p><!--[if--><!--[endif]--><!--![endif]--><!--![if--><!--![endif]--><!--![if--><!--![endif]--><!--![if--></p>
As I mentioned – we need A LOT more tests before we're able to accept this new filter.
I remember we were discussing generating fixtures for all the docx files that we have. How's that work going?
comment:30 Changed 9 years ago by
Problem with leaking comments seems to be fixed. Please cleanup unused branches on remote, rename t/9991b
to t/9991
and remove all other branches from the remote.
Did some reviewing last couple of days and here are the notes so far:
problems:
- failing tests,
- font plugin:
- ACF selectors must be as narrow as possible - this one essentially allows to put any mess inside span[style]. Same thing for other plugins.
- add new transformation to existing font plugin features instead defining new entries
- transformations should ensure that target transformation can be applied. Assune a situation when ACF is set in a way that it allows
font[http://docs.ckeditor.com/#!/api/CKEDITOR.filter-method-addTransformations face]
but does not allowspan{font-family}
with current implementation it would be converted to span (no matter what) and then... removed by ACF. You can do this with [check
property].
- plugin.js it should assign
element.styles[ 'text-align' ]
, - default.js use hasClass instead,
- default.js - I believe returning false does filter out the element, so you don't need to rely on internals (so that removing element name deletes the elem),
style:
- empty comment in default.js,
- instead disabling default.js jshint error use
element.attributes[ 'class' ]
for reserved keywords, - default.js always use braces +
convertToFakeListItem
should go to next line, - simply use ifs instead of doing one-liner expression like in plugin.js
comment:31 Changed 9 years ago by
Status: | review_failed → review |
---|
Fixed, changes pushed to branch:t/9991.
One thing that hasn't been changed was:
- default.js use hasClass instead,
hasClass
wouldn't work here, because the aim here is not to do an exact match, but to check whether there's a class that contains the string MsoListParagraph
.
comment:32 Changed 9 years ago by
- you shouldn't create new
CKEDITOR_MOCK
global variable. Instead createCKEDITOR.plugins.pastefromword
static namespace and put methods over there. However mark members that you don't want to be used outside of pfw as "@private".- Now within this namespace, create members like
CKEDITOR.plugins.pastefromword.lists
, where all list-related members should go. createLists
is too long and does too many things. You should extract multiple parts:- fake list item transfomration part should be extracted to
retainListItems
where it would returnlistElements
arary. - list grouping code should also be extracted into a separate member.
- fake list item transfomration part should be extracted to
- Now within this namespace, create members like
- we can't be so eager to drop some attributes from
resetStyles
- white background - you could easily produce nested content that will be broken by this rule.
See whiteBg.png
- same thing goes for font color
- this switch can be simplifed down to
if
. I can see that idea behind it might be for simplicity to extend down the road, but we can keep it simple here, refactorization will happen as needed. - let's simplify it down to
normalizedStyles( element ) || false;
- there area 2 "13590CkEditor-NumberList" directories in Tickets directory with a different casing. It looks like redundancy.
Changed 9 years ago by
Attachment: | whiteBg.png added |
---|
comment:33 Changed 9 years ago by
Applied the changes from comment:32, also implemented the rest of the features that were required to cover all the closed tickets concerning PFW.
Some cases have inputs(chiefly from IE11) that still create conflicts. These tickets are: 7131
, 7209
, 7872
, 7581
("stupid list" case).
Most probably there is no real way to achieve the desired output in all these cases, so eventually we will have to decide which are more important and cover them.
Changes pushed to: branch:t/9991.
EDIT: rebased with major
branch.
comment:34 Changed 9 years ago by
Note: It make sense to check if TC from #14544 is still valid here.
comment:35 Changed 8 years ago by
Description: | modified (diff) |
---|
comment:36 Changed 8 years ago by
I've been reviewing PFW again for a while now, and I found a case where pasting content would cause an exception. It's easy to fix for me, however I need to prepare proper tests to prevent us from regressions.
@Tade0 could you please provide some sort of instructions on generating a proper Word test?
I'm adding the docx fixture as an attachement ("broken-list.docx").
Changed 8 years ago by
Attachment: | fixture-utility-manual.md added |
---|
Manual for the client side of the PFW fixture generator.
comment:37 Changed 8 years ago by
I've included an attachment that should serve as a guide how to generate tests.
comment:38 Changed 8 years ago by
Test Custom_list_markers
I've fixed a bug where this TC would cause an exception. With that I discovered couple of things:
- List should be converted to unordered list (it's identified as such in MS Word). Instead it's converted to two ordered lists, and list gets start value
19
. That's actually a regression to a current implementation. - Since we're not supporting custom markers yet, we should filter out custom markers. Those could be recognized by:
- Checking if src starts with "file://" protocol and have an alt equal to "
*
".
- Checking if src starts with "file://" protocol and have an alt equal to "
comment:39 follow-up: 40 Changed 8 years ago by
Added this TC and made changes so that there is an unordered list there.
Changes pushed to branch:t/9991.
As for the src
starting with file://
- regular images are also pasted like that, so it can't be counted as a distinguishing feature. That alt
is uncommon enough though.
comment:40 follow-up: 41 Changed 8 years ago by
Replying to Tade0:
As for the
src
starting withfile://
- regular images are also pasted like that, so it can't be counted as a distinguishing feature. Thatalt
is uncommon enough though.
Yes, this is why both conditions must be met.
comment:41 Changed 8 years ago by
Replying to m.lewandowski:
Replying to Tade0:
As for the
src
starting withfile://
- regular images are also pasted like that, so it can't be counted as a distinguishing feature. Thatalt
is uncommon enough though.Yes, this is why both conditions must be met.
Thing is, it's extremely hard not to meet that first condition, therefore it doesn't really check anything(it would still be possible to create a test case which would break here).
Alternatively, as an additional condition it could be checked whether the image is wrapped in a <span>
with style="mso-list:Ignore"
- that's a telltale sign that the image is used as a symbol.
comment:42 Changed 8 years ago by
Implemented the solution described in the previous comment.
Changes pushed to branch:t/9991.
comment:43 Changed 8 years ago by
Since the tests are prone to breaking due to the sheer amount of them I'll cut them into manageable pieces.
comment:44 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | review → closed |
Fixed with git:f2af0353f8fb20231c7ca6de080c6cad273195eb merged to major.
Adding #10267 as a test case for this issue.