A possible filter for content would be the great Readability API. It does such a good job of cleaning up websites and ending up with just very clean text:
https://www.readability.com/developers/api
Another option would be to just include wallabag's cleanup algorithm:
https://github.com/wallabag/wallabag/tree/master/inc/poche