I’ve been working on a way of analysing a page of text stored in image format (a PNG file) to try and break the text up into normal text, separated text, mathematical equations and things that are just diagrams. I post it here so that hopefully I might get some feedback.
I posted this previously in Reddit but it didn’t get a lot of traction or start much of a discussion. I am hoping I might here.
I am by no means an expert in Javascript. My language of choice has been in Python but there was no good way of getting visual feedback quickly and easily as to how well my algorithm was working.
At the moment the whole thing is experimental and more of a proof of concept. In the future I am hoping it will be able to analyse uploaded PDFs and convert them to a combination of paragraph text (possibly using something like Tesseract) and for the mathematical equations to AsciiMath or Latex using something like MathPix. Then I would be able to pass it into some kind of text to speech engine.
I am still at the preliminary stages but a working demo is on https://jsdemo.kiwiheretic.xyz/
Source code is accessible from there but is also on https://github.com/kiwiheretic/textextract
Someone, after viewing an earlier version of my project, suggested that Tesseract should already do this. However, as far as I can tell, Tesseract only gives the location of decipherable text on the page and not things like diagrams.
It was written initially in Python but I converted it to Javascript because it was far easier to debug visually with Chrome developer tools. As a side benefit it also seems to run much faster. It’s not particularly optimised so I am sure it could even go much faster.
The process of boxing the characters and rows of text but is probably grossly inefficient but the basic idea is as follows:
Pixel data is converted into horizontal line segments by the data_to_raster function. The line segments being stored in an array of the format [y, [x1, x2], [x3, x4]. [x5,x6]… ] where y is the height offset of that particular image pixel row and data like [x1,x2] says that all the pixels from x1 to x2 are coloured in.
Then find_boxes attempts to reread that array and turn it into boxes containing individual characters. It returns an array of arrays where the inner array takes the form [x1, y1, x2, y2] which describe the bounding box for the character inside. I initially thought it should try and find glyphs (character shapes) by working out how the individual dots were connected to make up a character. However this doesn’t apply for the small letter “i” which contains a small dot at the top, for example. Therefore what I opted for in the head is that that a given character is anything that overlaps with its horizontal space. This does render the occasional false positives (like on the first line of the demo where “Ch” are wrongly guessed as one character) but doesn’t seem to cause any real problems.
The get_row_sorted function, which is rather badly named, is to sort the boxes of the characters in the array to the order that they would appear in the page from left to right.
The group_into_paragraphs function attempts to divide the rows into paragraphs by computing the quartile of the row spacing and then using that to determine if the previous row belongs to the same paragraph as the next row.
The stepBoxes.step function is then used to pull all the characters out of a two dimensional array (called rowData) where the major index refers to which “line” the characters are onand the minor index is the position of that character on the line. Then I try to do some crude statistical analysis on the median and upper quartile spacing on the spacing between characters in order to further break up a row into independent segments. (You can see this in the demo on the first line and also with many of the lines containing mathematical equations.) This method is pretty crude and there is probably a better way of doing this. This whole function runs on a timer callback for the purpose of running it slower to make it easier to see what is happening in practice.
Plans for the future include a Node JS backend and the ability to load PDFs and have it produce a new PDF with image of text replaced by actual OCR’d text.
This code is by no means work of art. It needs to be refactored. I am just not 100% sure yet where that refactoring should occur. Certainly it could be made more Dry. The code could contain Javascript generators instead of loops in some places. However I think there are also places where it is unavoidable to run through an array or list twice. For instance in computing the median spacing between characters which also requires a sort. I probably also need to do the same thing for the text rows in order to identify where the paragraphs are.
Even at this stage I am happy for anyone to point me to a library or web application that already does this. Especially if it can convert the mathematical equations into text as well. (I just don’t know of any.)
Anyway, constructive criticism and advice are welcomed.
Thanks