Large corpora like PubMed abstracts are assumed to be tagged with entity types. Can you describe how this is done? If this is done by dictionary matching, how many sentences are satisfied when the requirements are 1) each sentence should have at least 2 entities, and 2) one is a gene type and the other is a disease type?