Skip to content

Conversation

@jtojnar
Copy link
Contributor

@jtojnar jtojnar commented May 27, 2025

Fix character decoding regression when title precedes meta[charset]

Because of PHP 8.2 deprecation, in f14428e, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert meta[charset] tag at the start of the document.

Later, we discovered that was breaking html[lang] so, in efbbc86, we made the insertion smarter. One of the improvements was that it would not insert the meta[charset] tag when it was already present.

That, however, broke websites that had title tag before meta[charset]. On those, libxml2 would decode the title contents as ISO-8859-1.

We could improve the logic (e.g. check that there is not text content before meta[charset]) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.

Fixes: wallabag/wallabag#8158

jtojnar added 2 commits May 28, 2025 00:51
The fix introduced in efbbc86 alongside this test also manipulates `meta[charset]` but we were not checking if it does not break encoding.
Because of PHP 8.2 deprecation, in f14428e, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document.

Later, we discovered that was breaking `html[lang]` so, in efbbc86, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present.

That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1.

We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.
@j0k3r j0k3r merged commit 3042990 into j0k3r:master Jun 3, 2025
10 checks passed
@jtojnar jtojnar deleted the encode branch June 3, 2025 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[2.6.12 regression] Latin1 instead of UTF-8 used for entries (Ã)

2 participants