#13174 closed Bug (fixed)
Data loss after pasting from MS Word (with FF/Chrome under Windows only)
Reported by: | emanuel | Owned by: | |
---|---|---|---|
Priority: | Normal | Milestone: | CKEditor 4.6.0 |
Component: | Plugin : Paste from Word | Version: | 4.0.1 |
Keywords: | Cc: | satya_minnekanti@… |
Description
There is a serious bug if you copy from MS Word and paste the content in latest stable CKEditor (4.4.7) with Firefox or Chrome under Windows!
You can easily reproduce it:
- Open CKEditor in Firefox or Chrome on Windows.
- Copy all content from the attached docx test document opened with MS Word (I tested with Word 2010). My example (The beginning question mark is important!):
Test A „Test B“ TestC
- Paste into CKEditor (via CTRL+V or "from word" or "as text")
=> 2nd line was dropped after pasting!
See the HTML source code view:
<p>Test A</p> <ul> <li> </li> </ul> <p>Test C</p>
The problem occurs with Firefox FF (37.0.x) and Chrome (41.0.x) under Windows. IE11 works fine without problems. Also no problems (with Firefox and Chrome) if you copy the content from LibreOffice instead from MS Word.
This problem is very serious because you assume that content was paste!
Attachments (3)
Change History (18)
Changed 10 years ago by
Attachment: | Testdocument.docx added |
---|
comment:1 Changed 10 years ago by
comment:2 Changed 10 years ago by
Cc: | satya_minnekanti@… added |
---|
comment:3 Changed 10 years ago by
Status: | new → confirmed |
---|---|
Version: | 4.4.7 → 4.0.1 |
Problem can be reproduced from CKEditor 4.0.1 in all browses except IE11.
comment:4 Changed 10 years ago by
Any idea for a hot fix? Any milestone scheduled?
I think this issue is critical. It makes CKEditor unusable for users who have to copy/paste from MS Word into CKEditor.
comment:6 Changed 10 years ago by
Owner: | set to Piotrek Koszuliński |
---|---|
Status: | confirmed → assigned |
comment:7 Changed 10 years ago by
Status: | assigned → review |
---|
This of course proved to be very difficult ticket. Not because finding the bug in the code was hard, but because what MS Word (or browsers - dunno) do is so ridiculous and illogical that it's hard to do anything with that. For example - the HTML that we get when pasting content of the attached file is marked with MsoListParagraph
classes which are used to mark paragraphs which are lists in MSWord. So #6662 introduced a heuristic which looks for a textual representation of a list bullet in such paragraphs, but there are so many different list types that writing a reliable heuristic for guessing whether some characters are bullets is impossible. Reasoning from the structure of these paragraphs is also very hard because I saw countless variations, depending on block styling, inline styling, list structure, version of MS Word, browser and phase of the moon. Madness.
As every time in such scenario minimal changes are best, so I decided to scan the part of the paragraph that follows something that we think is a bullet (in this specific case it was quote character, but it could be pretty much everything). If there's no content - this isn't a list item. This patch will obviously fail when:
- MSWord decides to complicate structure of paragraphs even more, so there will be some content after container of what we think is a list bullet.
- An empty list item. This sound serious, but from what I saw it's not a problem, because there's always something in an empty list item - e.g.:
<!--[endif]--><span lang="DE" style="font-size:12.0pt; line-height:115%;font-family:Arial"><o:p> </o:p>
Pushed branch:t/13174.
comment:8 Changed 10 years ago by
Status: | review → review_failed |
---|
Wow, that's crazy. In fact this is incorrect data we get at the very begging. I believe that it is Word who gives us wrong data, but it could be a browser too. These paragraphs are marked as a list items (have MsoListParagraph*
classes), so this is very hard to realize that they are not list items.
Anyway the fix is not good enough. It is enough to put anything after the text node (image or shape) and it stops working. Another example is to have bold at the begging (structure like this: <b>Test</b> C
) - and the paragraph is also replaced with the list item ("Test" became a bullet, because it was not the last element).
What is the real issue here, is losing data. We are taking the parent of the text node which should be a bullet and remove it, because we assume that this is a <span>
with the bullet character, but in these cases it could be a whole bolded text or a whole paragraph. Because of the custom bullet types we are not able to recognize the bullet, but we can assume the it is not longer then 5 character. With this assumption even if we incorrectly recognize something as a bullet we will at most break very small part of the document.
Changed 10 years ago by
Attachment: | Testdocument2.docx added |
---|
comment:9 Changed 10 years ago by
Status: | review_failed → review |
---|
I pushed improvements to branch:t/13174. Now we check that the bullet text is shorter than 4 character. Together with the check I added previously I think that most cases are handled. If there's a real mess in a pasted HTML that may cause a short text nodes to appear, then they will be still lost, but we cannot do more.
PS. We'll work on new MSWord filter soon and all this code will be gone. It is very likely that we will be able to find a better heuristic having a clear picture.
comment:10 Changed 10 years ago by
Status: | review → review_failed |
---|
It still does not work good enough. You check the length of the bulletText
, but it is bullet
who is replaced. So it is still possible to have big paragraph removed on paste, see attachment Testdocument3.docx
.
Changed 10 years ago by
Attachment: | Testdocument3.docx added |
---|
comment:11 Changed 10 years ago by
Milestone: | CKEditor 4.4.8 |
---|---|
Owner: | Piotrek Koszuliński deleted |
Status: | review_failed → confirmed |
Right... ok, this is enough. We can of course try to count length of all text nodes inside the bullet
but then someone will report that word "foo" is removed, because we check the length. And so on and so on. I'm removing the milestone.
comment:12 Changed 9 years ago by
Any idea how to resolve this nasty bug soon? (which milestone?) Maybe I can help testing... Thx.
comment:13 Changed 9 years ago by
No, we don't have an idea how to fix it without breaking other things. HTML that we get from MS Word is too bad in this case. You can try the hack from comment:7 (see branch:t/13174), maybe it will work for your case.
comment:14 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | confirmed → closed |
Fixed with new Paste From Word plugin in 4.6.0.
comment:15 Changed 8 years ago by
Milestone: | → CKEditor 4.6.0 |
---|
Looks similar to issue #10784.