I’ve not worked with the Text Extractor plugin, or the other tools you mention, but the text you provided as example in that last part, could that be stripped of all white space, and then be used in your search?
It seems like that text just introduces a lot of spaces where you didn’t intend for them to be, but other than that the text seems to be coherent enough for searching.
You could end up with some false positives if you search for something which happens to be the end of one word and the start of another. (Like if you searched the previous sentence for “fan”, it would render a hit on “oFANother”). But no sure if this in general would be a great issue.
@holroy I was thinking on a similar line. If the extractor strips off all the white space, the text will be searchable though not correct 100%, it would be a temporary fix. The better option is to use the embedded text properly. I don’t see any space in the embedded text.
I tried pdfminer and the result is encouraging. As all the text is embedded in the PDF it was able to extract the text with quite a good accuracy. @scambier maybe text-extractor was not extracting these embedded texts but it was using tesseract to ORC the pdf, which is not very accurate. Is it possible to use these or any better API, I am ready to spend some time and contribute back.