I've some PDFs which are in Hindi, and have extractable text. I used pdfminer.six for python 3.6, to do the extraction. The output looks like:
As one can see, there are a number of characters that are converted into the form "(cid :number)".
On further analysis, I found out that a PDF contains CMAPs which map character codes to glyph indices. So, a CID is a character identity for the glyph it maps to, inside the CMAP table.
But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?
Moreover, according to a comment to this similar question, this process may not be legal. But I'm not trying to steal someone's font. I want the text. How does this process become illegal?
Since there are many questions like this one, I want to