Fix character decoding regression when `title` precedes `meta[charset]` #106

jtojnar · 2025-05-27T23:12:03Z

Fix character decoding regression when title precedes meta[charset]

Because of PHP 8.2 deprecation, in f14428e, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert meta[charset] tag at the start of the document.

Later, we discovered that was breaking html[lang] so, in efbbc86, we made the insertion smarter. One of the improvements was that it would not insert the meta[charset] tag when it was already present.

That, however, broke websites that had title tag before meta[charset]. On those, libxml2 would decode the title contents as ISO-8859-1.

We could improve the logic (e.g. check that there is not text content before meta[charset]) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.

Fixes: wallabag/wallabag#8158

The fix introduced in efbbc86 alongside this test also manipulates `meta[charset]` but we were not checking if it does not break encoding.

Because of PHP 8.2 deprecation, in f14428e, we stopped converting non-ASCII characters to HTML entities. Instead, we started to explicitly insert `meta[charset]` tag at the start of the document. Later, we discovered that was breaking `html[lang]` so, in efbbc86, we made the insertion smarter. One of the improvements was that it would not insert the `meta[charset]` tag when it was already present. That, however, broke websites that had `title` tag before `meta[charset]`. On those, libxml2 would decode the `title` contents as ISO-8859-1. We could improve the logic (e.g. check that there is not text content before `meta[charset]`) or insert the tag unconditionally but it will probably be simplest to just go back to converting the non-ASCII characters to entities, just using non-deprecated function variant.

jtojnar added 2 commits May 28, 2025 00:51

tests: Check encoding was preserved in testHtmlLang

3e9b15d

The fix introduced in efbbc86 alongside this test also manipulates `meta[charset]` but we were not checking if it does not break encoding.

j0k3r approved these changes May 28, 2025

View reviewed changes

j0k3r merged commit 3042990 into j0k3r:master Jun 3, 2025
10 checks passed

j0k3r mentioned this pull request Jun 3, 2025

[1.x] Backport character decoding regression #107

Merged

jtojnar deleted the encode branch June 3, 2025 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix character decoding regression when `title` precedes `meta[charset]` #106

Fix character decoding regression when `title` precedes `meta[charset]` #106

Uh oh!

jtojnar commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix character decoding regression when title precedes meta[charset] #106

Fix character decoding regression when title precedes meta[charset] #106

Uh oh!

Conversation

jtojnar commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix character decoding regression when `title` precedes `meta[charset]` #106

Fix character decoding regression when `title` precedes `meta[charset]` #106