Opened 9 years ago

Closed 7 years ago

Last modified 7 years ago

#13174 closed Bug (fixed)

Data loss after pasting from MS Word (with FF/Chrome under Windows only)

Reported by: emanuel Owned by:
Priority: Normal Milestone: CKEditor 4.6.0
Component: Plugin : Paste from Word Version: 4.0.1
Keywords: Cc: satya_minnekanti@…

Description

There is a serious bug if you copy from MS Word and paste the content in latest stable CKEditor (4.4.7) with Firefox or Chrome under Windows!

You can easily reproduce it:

  1. Open CKEditor in Firefox or Chrome on Windows.
  1. Copy all content from the attached docx test document opened with MS Word (I tested with Word 2010). My example (The beginning question mark is important!):
    Test A
    „Test B“
    TestC
    
  1. Paste into CKEditor (via CTRL+V or "from word" or "as text")

=> 2nd line was dropped after pasting!

See the HTML source code view:

<p>Test A</p>
<ul>
    <li>&nbsp;</li>
</ul>
<p>Test C</p>

The problem occurs with Firefox FF (37.0.x) and Chrome (41.0.x) under Windows. IE11 works fine without problems. Also no problems (with Firefox and Chrome) if you copy the content from LibreOffice instead from MS Word.

This problem is very serious because you assume that content was paste!

Attachments (3)

Testdocument.docx (12.9 KB) - added by emanuel 9 years ago.
Testdocument2.docx (15.3 KB) - added by Piotr Jasiun 9 years ago.
Testdocument3.docx (15.6 KB) - added by Piotr Jasiun 9 years ago.

Download all attachments as: .zip

Change History (18)

Changed 9 years ago by emanuel

Attachment: Testdocument.docx added

comment:1 Changed 9 years ago by emanuel

Looks similar to issue #10784.

comment:2 Changed 9 years ago by Satya Minnekanti

Cc: satya_minnekanti@… added

comment:3 Changed 9 years ago by Jakub Ś

Status: newconfirmed
Version: 4.4.74.0.1

Problem can be reproduced from CKEditor 4.0.1 in all browses except IE11.

comment:4 Changed 9 years ago by emanuel

Any idea for a hot fix? Any milestone scheduled?

I think this issue is critical. It makes CKEditor unusable for users who have to copy/paste from MS Word into CKEditor.

comment:5 Changed 9 years ago by Piotrek Koszuliński

Milestone: CKEditor 4.4.8

We'll try to fix this issue in 4.4.8.

comment:6 Changed 9 years ago by Piotrek Koszuliński

Owner: set to Piotrek Koszuliński
Status: confirmedassigned

comment:7 Changed 9 years ago by Piotrek Koszuliński

Status: assignedreview

This of course proved to be very difficult ticket. Not because finding the bug in the code was hard, but because what MS Word (or browsers - dunno) do is so ridiculous and illogical that it's hard to do anything with that. For example - the HTML that we get when pasting content of the attached file is marked with MsoListParagraph classes which are used to mark paragraphs which are lists in MSWord. So #6662 introduced a heuristic which looks for a textual representation of a list bullet in such paragraphs, but there are so many different list types that writing a reliable heuristic for guessing whether some characters are bullets is impossible. Reasoning from the structure of these paragraphs is also very hard because I saw countless variations, depending on block styling, inline styling, list structure, version of MS Word, browser and phase of the moon. Madness.

As every time in such scenario minimal changes are best, so I decided to scan the part of the paragraph that follows something that we think is a bullet (in this specific case it was quote character, but it could be pretty much everything). If there's no content - this isn't a list item. This patch will obviously fail when:

  • MSWord decides to complicate structure of paragraphs even more, so there will be some content after container of what we think is a list bullet.
  • An empty list item. This sound serious, but from what I saw it's not a problem, because there's always something in an empty list item - e.g.:
<!--[endif]--><span lang="DE" style="font-size:12.0pt;
line-height:115%;font-family:Arial"><o:p>&nbsp;</o:p>

Pushed branch:t/13174.

comment:8 Changed 9 years ago by Piotr Jasiun

Status: reviewreview_failed

Wow, that's crazy. In fact this is incorrect data we get at the very begging. I believe that it is Word who gives us wrong data, but it could be a browser too. These paragraphs are marked as a list items (have MsoListParagraph* classes), so this is very hard to realize that they are not list items.

Anyway the fix is not good enough. It is enough to put anything after the text node (image or shape) and it stops working. Another example is to have bold at the begging (structure like this: <b>Test</b> C) - and the paragraph is also replaced with the list item ("Test" became a bullet, because it was not the last element).

What is the real issue here, is losing data. We are taking the parent of the text node which should be a bullet and remove it, because we assume that this is a <span> with the bullet character, but in these cases it could be a whole bolded text or a whole paragraph. Because of the custom bullet types we are not able to recognize the bullet, but we can assume the it is not longer then 5 character. With this assumption even if we incorrectly recognize something as a bullet we will at most break very small part of the document.

Last edited 9 years ago by Piotrek Koszuliński (previous) (diff)

Changed 9 years ago by Piotr Jasiun

Attachment: Testdocument2.docx added

comment:9 Changed 9 years ago by Piotrek Koszuliński

Status: review_failedreview

I pushed improvements to branch:t/13174. Now we check that the bullet text is shorter than 4 character. Together with the check I added previously I think that most cases are handled. If there's a real mess in a pasted HTML that may cause a short text nodes to appear, then they will be still lost, but we cannot do more.

PS. We'll work on new MSWord filter soon and all this code will be gone. It is very likely that we will be able to find a better heuristic having a clear picture.

comment:10 Changed 9 years ago by Piotr Jasiun

Status: reviewreview_failed

It still does not work good enough. You check the length of the bulletText, but it is bullet which is replaced. So it is still possible to have big paragraph removed on paste, see attachment Testdocument3.docx​.

Last edited 9 years ago by Piotr Jasiun (previous) (diff)

Changed 9 years ago by Piotr Jasiun

Attachment: Testdocument3.docx added

comment:11 Changed 9 years ago by Piotrek Koszuliński

Milestone: CKEditor 4.4.8
Owner: Piotrek Koszuliński deleted
Status: review_failedconfirmed

Right... ok, this is enough. We can of course try to count length of all text nodes inside the bullet but then someone will report that word "foo" is removed, because we check the length. And so on and so on. I'm removing the milestone.

comment:12 Changed 9 years ago by emanuel

Any idea how to resolve this nasty bug soon? (which milestone?) Maybe I can help testing... Thx.

comment:13 Changed 9 years ago by Piotrek Koszuliński

No, we don't have an idea how to fix it without breaking other things. HTML that we get from MS Word is too bad in this case. You can try the hack from comment:7 (see branch:t/13174), maybe it will work for your case.

comment:14 Changed 7 years ago by Tade0

Resolution: fixed
Status: confirmedclosed

Fixed with new Paste From Word plugin in 4.6.0.

comment:15 Changed 7 years ago by Anna Tomanek

Milestone: CKEditor 4.6.0
Note: See TracTickets for help on using tickets.
© 2003 – 2022, CKSource sp. z o.o. sp.k. All rights reserved. | Terms of use | Privacy policy