In particular, for each word, it returns a bounding box that looks like this: It groups text into chunks (pages, blocks, paragraphs, words, and characters) and returns its location on the page. This is the approach that Kaz, the original author of this project, took when trying to turn textbooks into audiobooks.Įarlier in this post, I mentioned that the Google Cloud Vision API returns not just text on the page, but also its layout. We show the model a bunch of examples of body text, header text, and so on, and hopefully it learns to recognize them. Using spatial information about the layout of the text on the page, we can train a machine learning model to do that, too. When you look at a research paper, it’s probably easy for you to gloss over the irrelevant bits just by noting the layout: titles are large and bolded captions are small body text is medium-sized and centered on the page. Finding Relevant Text with Machine Learning In this post, I’ll show you two approaches, one that’s quick ‘n dirty and one that’s high-quality but a bit more work. It turns out identifying those relevant sections is a tricky problem with lots of possible solutions. What part of a research paper do we want to include in an audiobook? Probably the paper’s title, the author’s name, section headers, body text, but none of these bits highlighted in red: So in the next step, we’ll decide which bits of raw text should be included in the audiobook. But you’re not a doofus, and you probably don’t want to do that, because then you’d be listening to all sorts of uninteresting artifacts like image captions, page numbers, document footers, and so on. Here’s what the response looks like:Īs you can see, the API returns not just the raw text on the page, but also each character’s (x, y) position.Īt this point, you could take all that raw text and dump it straight into an audiobook, if you’re a doofus. When you pass a document through the Vision API, you’re returned both raw text as well as layout information. Check out Kaz’s GitHub repo to see exactly how you call the API. This API extracts not only text but also intelligently parses tables and formsįor this project, I used the Vision API (which is cheaper than the new Document AI API), and found the quality to be quite good. The (new!) Google Cloud Document AI API.Calamari, on open-source Python library.You could use lots of different types of tools to do this, like: Here’s what it looks like:įirst, we’ll extract the text from the document using OCR. PDF TO IEEE FORMAT CONVERTER SOFTWARE HOW TOIn this post, I’ll show you how to convert this dense research paper (“A Promising Path Towards Autoformalization and General Artificial Intelligence”) into an audiobook.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |