Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#13105 closed Bug (fixed)

Rich attributes + DX WTF: CKEDITOR.tools.html(Encode|Decode)()

Reported by: Wim Leers Owned by: Piotrek Koszuliński
Priority: Normal Milestone: CKEditor 4.5.0
Component: General Version:
Keywords: Drupal Cc: wim.leers@…

Description

Rich attributes

This problem was discovered by using "rich attributes" (IIRC that's how @Reinmar described it): CKEditor Widgets whose data is stored in data- attributes.

Drupal 8's drupalimagecaption plugin extends CKEditor 4's image2 plugin, and instead of storing the caption in a <figcaption> 'tag', it stores it in a data-caption 'attribute'. This is where things go wrong.

Whenever the user writes plain HTML tags in the caption (i.e. in the widget's "caption" box enter something like <a href="http://drupal.org">test</a>), the data stored in the data-caption attribute is only HTML-encoded once, instead of twice, which upon upcasting then doesn't result in the entered HTML showing up, but in the 'rendered' HTML showing up.

See https://www.drupal.org/node/2460145.

I was convinced this was actually a bug on the Drupal side. So I started using CKEDITOR.tools.html(En|De)code(Attr)(), but… doing so uncovered a whole new problem…

DX WTF: CKEDITOR.tools.html(En|De)code(Attr)()

Copy/pasted from the Chrome developer console, to demonstrate a bug.

var e = CKEDITOR.tools.htmlEncode
undefined

var d = CKEDITOR.tools.htmlDecode
undefined

e('<tag>')
"&lt;tag&gt;"

e(e('<tag>'))
"&amp;lt;tag&amp;gt;"

d(e('<tag>'))
"<tag>"

d(e(e('<tag>')))
"<tag>"

d(e(e(e(e('<tag>')))))
"&amp;amp;lt;tag&amp;amp;gt;"

d(e(e(e('<tag>'))))
"&amp;lt;tag&amp;gt;"

d(e(e('<tag>')))
"<tag>"

If this is by design, the documentation should warn developers. And it should be explained why.

Change History (13)

comment:1 Changed 9 years ago by Piotrek Koszuliński

I pushed branch:t/13105 with tests how a complicated attribute:

'<p x="1&quot;2">&lt;a y="3"&gt;&amp;&lt;/a&gt;</p>'

set inside data processor, while data is in pseudo-DOM format, is preserved when loading data retrieved from such editor. This test fails baaaadly.

More ideas for tests:

  • set attributes in the DOM and do the same - get and set data.
  • test wysiwygarea (it's different way of setting editable's content).

comment:2 Changed 9 years ago by Piotrek Koszuliński

I pushed more commits to the branch:t/13105. I fixed the issue with decoding and pushed more tests for decodeAttr which apparently works ok, but tests in http://tests.ckeditor.dev:1030/tests/tickets/13105/1 still fail.

comment:3 Changed 9 years ago by Jakub Ś

Status: newconfirmed

comment:4 Changed 9 years ago by Piotrek Koszuliński

Owner: set to Piotrek Koszuliński
Status: confirmedassigned

comment:6 Changed 9 years ago by Piotrek Koszuliński

I pushed more commits to t/13105 which fix first two test files.

However, the problem on IE is that an attribute '&#40;' is encoded to '&amp;40;' and this isn't correct. I have no idea though, where it happens in the code...

comment:7 Changed 9 years ago by Piotrek Koszuliński

On IE:

CKEDITOR.instances.editor1.dataProcessor.toHtml( '<p><a href="a\nb">x</a></p>' );
<p><a data-cke-saved-href="a&amp;#10;b" href="a&amp;#10;b">x</a></p>

And it actually makes a perfect sense. When parsing the tree we find an attribute 'a&#10;b' (it's already changed, because the input HTML is passed through a DOM element) and we try to decode it. However, we can decode only some entities (see tools.htmlDecodeAttr()), so the string isn't changed.

Then, it lands in htmlParser.element.attributes.href with the initial value = 'a&#10;b'.

Finally, when writing the tree down to an HTML string all attributes are encoded... hence 'a&amp;#10;b'.

So it seems that we must improve the decode function to decode numeric entities. This should not be a big problem, because we can use String.fromCodePoint().

comment:8 Changed 9 years ago by Piotrek Koszuliński

I've made a small research about what characters are being encoded and how when put into a text node or attribute. Results:

Observations for HTML text nodes:

  • We can ignore what happens with the null character (0).
  • The change from \r (13) to \n (10) is fortunately safe.
  • Then go 3 entities which we do decode in HTML – &lt;, &gt; and &amp;.
  • Then goes &nbsp; which we do not decode but which we should.
  • That would be all for normal browsers, but then go IEs.
  • Disappearance of few characters from range 10-13 and 32 can be ignored – some lost white spaces. Well... unless they matter for the final DOM structure and that happens (x<b> </b>y) ;|.
  • Disappearance of char 60 (<) could be understood (it causes HTML parse error), but that of char 38 (&) is weird... Good that both are encoded by our methods. Apparently that preserves them.
  • And then goes &shy; – definitely worth decoding.
  • And finally few lost chars from the end of the range can be ignored.

Observations for attributes:

  • Many similarities to text nodes of course.
  • &nbsp; and &shy; must be decoded.
  • Chars from range 1-31 are encoded by IEs using the decimal entity. All of those can be easily decoded using String.fromCharCode().

comment:9 Changed 9 years ago by Piotrek Koszuliński

I realised one additional thing. The role of htmlDecode() and htmlDecodeAttr() is to decode everything that could be encoded by the browsers. But nothing limits these functions from decoding more and this means that htmlDecode() can also decode &quot; leaving htmlDecodeAttr() unnecessary. And this is what I did in the final commit.

What's more, I also decided to go with a single regexp pass. Not for performance reasons (although I tested if there's no surprise waiting), but to avoid issues like Wim initially reported – that multiple separate passes will double decode something. And the risk of double decoding returned when I added numeric entities decoding in the second to last commit, because it's impossible to say which entities should be decoded first.

One more thing, if that's not clear:

  • htmlEncode/htmlEncodeAttr's role is to encode characters that may be unsafe for HTML or which may be lost (as it happens on IE8 - see comment:8). This is the smallest possible subset. There's no sense in adding more, because people may not want other entities to be encoded. This should be a configurable behaviour of some additional plugins like the entities plugin.
  • htmlDecode/htmlDecodeAttr are more free in this case, because they simply need to decode everything that a browser may encode. Hence, we needed the research made in comment:8 and some extensions that this ticket adds.

comment:10 Changed 9 years ago by Piotrek Koszuliński

Milestone: CKEditor 4.4.8
Status: assignedreview
Version: 4.4.7

Pushed branch:t/13105. Have fun :)!

comment:11 Changed 9 years ago by Olek Nowodziński

Status: reviewreview_passed

comment:12 Changed 9 years ago by Olek Nowodziński

Milestone: CKEditor 4.4.8CKEditor 4.5.0
Resolution: fixed
Status: review_passedclosed

Closed with git:2c45c7636660f3 in major. I decided to merge it into 4.5.0 rather than 4.4.8 because the changes may have some impact on 3rd party plugins and integrations.

comment:13 Changed 9 years ago by Wim Leers

Late note: I was able to close https://www.drupal.org/node/2460145 thanks to this — thanks again! :)

Note: See TracTickets for help on using tickets.
© 2003 – 2022, CKSource sp. z o.o. sp.k. All rights reserved. | Terms of use | Privacy policy