#13105 closed Bug (fixed)
Rich attributes + DX WTF: CKEDITOR.tools.html(Encode|Decode)()
Reported by: | Wim Leers | Owned by: | Piotrek Koszuliński |
---|---|---|---|
Priority: | Normal | Milestone: | CKEditor 4.5.0 |
Component: | General | Version: | |
Keywords: | Drupal | Cc: | wim.leers@… |
Description
Rich attributes
This problem was discovered by using "rich attributes" (IIRC that's how @Reinmar described it): CKEditor Widgets whose data is stored in data-
attributes.
Drupal 8's drupalimagecaption
plugin extends CKEditor 4's image2
plugin, and instead of storing the caption in a <figcaption>
'tag', it stores it in a data-caption
'attribute'. This is where things go wrong.
Whenever the user writes plain HTML tags in the caption (i.e. in the widget's "caption" box enter something like <a href="http://drupal.org">test</a>
), the data stored in the data-caption
attribute is only HTML-encoded once, instead of twice, which upon upcasting then doesn't result in the entered HTML showing up, but in the 'rendered' HTML showing up.
See https://www.drupal.org/node/2460145.
I was convinced this was actually a bug on the Drupal side. So I started using CKEDITOR.tools.html(En|De)code(Attr)()
, but… doing so uncovered a whole new problem…
DX WTF: CKEDITOR.tools.html(En|De)code(Attr)()
Copy/pasted from the Chrome developer console, to demonstrate a bug.
var e = CKEDITOR.tools.htmlEncode undefined var d = CKEDITOR.tools.htmlDecode undefined e('<tag>') "<tag>" e(e('<tag>')) "&lt;tag&gt;" d(e('<tag>')) "<tag>" d(e(e('<tag>'))) "<tag>" d(e(e(e(e('<tag>'))))) "&amp;lt;tag&amp;gt;" d(e(e(e('<tag>')))) "&lt;tag&gt;" d(e(e('<tag>'))) "<tag>"
If this is by design, the documentation should warn developers. And it should be explained why.
Change History (13)
comment:1 Changed 10 years ago by
comment:2 Changed 10 years ago by
I pushed more commits to the branch:t/13105. I fixed the issue with decoding and pushed more tests for decodeAttr which apparently works ok, but tests in http://tests.ckeditor.dev:1030/tests/tickets/13105/1 still fail.
comment:3 Changed 10 years ago by
Status: | new → confirmed |
---|
comment:4 Changed 10 years ago by
Owner: | set to Piotrek Koszuliński |
---|---|
Status: | confirmed → assigned |
comment:5 Changed 10 years ago by
Failing on Chrome and Firefox:
- http://tests.ckeditor.dev:1030/tests/plugins/bbcode/bbcode
- http://tests.ckeditor.dev:1030/tests/plugins/link/mail_link
Additional test failing on IE11:
comment:6 Changed 10 years ago by
I pushed more commits to t/13105 which fix first two test files.
However, the problem on IE is that an attribute '('
is encoded to '&40;'
and this isn't correct. I have no idea though, where it happens in the code...
comment:7 Changed 10 years ago by
On IE:
CKEDITOR.instances.editor1.dataProcessor.toHtml( '<p><a href="a\nb">x</a></p>' ); <p><a data-cke-saved-href="a&#10;b" href="a&#10;b">x</a></p>
And it actually makes a perfect sense. When parsing the tree we find an attribute 'a b'
(it's already changed, because the input HTML is passed through a DOM element) and we try to decode it. However, we can decode only some entities (see tools.htmlDecodeAttr()
), so the string isn't changed.
Then, it lands in htmlParser.element.attributes.href
with the initial value = 'a b'
.
Finally, when writing the tree down to an HTML string all attributes are encoded... hence 'a&#10;b'
.
So it seems that we must improve the decode function to decode numeric entities. This should not be a big problem, because we can use String.fromCodePoint()
.
comment:8 Changed 9 years ago by
I've made a small research about what characters are being encoded and how when put into a text node or attribute. Results:
- tests – https://gist.github.com/Reinmar/e3a947705ff8a3697cca
- results – https://gist.github.com/Reinmar/3e73992ceca034d94a7a
Observations for HTML text nodes:
- We can ignore what happens with the null character (0).
- The change from
\r
(13) to\n
(10) is fortunately safe. - Then go 3 entities which we do decode in HTML –
<
,>
and&
. - Then goes
which we do not decode but which we should. - That would be all for normal browsers, but then go IEs.
- Disappearance of few characters from range 10-13 and 32 can be ignored – some lost white spaces. Well... unless they matter for the final DOM structure and that happens (
x<b> </b>y
) ;|. - Disappearance of char 60 (<) could be understood (it causes HTML parse error), but that of char 38 (&) is weird... Good that both are encoded by our methods. Apparently that preserves them.
- And then goes
­
– definitely worth decoding. - And finally few lost chars from the end of the range can be ignored.
Observations for attributes:
- Many similarities to text nodes of course.
and­
must be decoded.- Chars from range 1-31 are encoded by IEs using the decimal entity. All of those can be easily decoded using
String.fromCharCode()
.
comment:9 Changed 9 years ago by
I realised one additional thing. The role of htmlDecode()
and htmlDecodeAttr()
is to decode everything that could be encoded by the browsers. But nothing limits these functions from decoding more and this means that htmlDecode()
can also decode "
leaving htmlDecodeAttr()
unnecessary. And this is what I did in the final commit.
What's more, I also decided to go with a single regexp pass. Not for performance reasons (although I tested if there's no surprise waiting), but to avoid issues like Wim initially reported – that multiple separate passes will double decode something. And the risk of double decoding returned when I added numeric entities decoding in the second to last commit, because it's impossible to say which entities should be decoded first.
One more thing, if that's not clear:
- htmlEncode/htmlEncodeAttr's role is to encode characters that may be unsafe for HTML or which may be lost (as it happens on IE8 - see comment:8). This is the smallest possible subset. There's no sense in adding more, because people may not want other entities to be encoded. This should be a configurable behaviour of some additional plugins like the entities plugin.
- htmlDecode/htmlDecodeAttr are more free in this case, because they simply need to decode everything that a browser may encode. Hence, we needed the research made in comment:8 and some extensions that this ticket adds.
comment:10 Changed 9 years ago by
Milestone: | → CKEditor 4.4.8 |
---|---|
Status: | assigned → review |
Version: | 4.4.7 |
Pushed branch:t/13105. Have fun :)!
comment:11 Changed 9 years ago by
Status: | review → review_passed |
---|
comment:12 Changed 9 years ago by
Milestone: | CKEditor 4.4.8 → CKEditor 4.5.0 |
---|---|
Resolution: | → fixed |
Status: | review_passed → closed |
Closed with git:2c45c7636660f3 in major. I decided to merge it into 4.5.0 rather than 4.4.8 because the changes may have some impact on 3rd party plugins and integrations.
comment:13 Changed 9 years ago by
Late note: I was able to close https://www.drupal.org/node/2460145 thanks to this — thanks again! :)
I pushed branch:t/13105 with tests how a complicated attribute:
set inside data processor, while data is in pseudo-DOM format, is preserved when loading data retrieved from such editor. This test fails baaaadly.
More ideas for tests: