Some abbreviations have inconsistent whitespace, for example spelling e. g. with space. The tokenizer should have some way of eliminating spaces in these based on a list in some file, possibly producing some annotation that indicates the original spelling (maybe sic+hi@rend="x-space"):
e.g.
Or adding an attribute with the original spelling (could do , though that is not really TEI)