OCR
Extracting Text from Images
To extract text from images, a method called optical character recognition (OCR) is used. To process images on the Diggernaut platform, the digger needs to load the image first (digger will automatically converts all images to XML with element where you can extract binary content encoded as base64), then extract base64 encoded image contents from XML to the register. And after it you can use the ocr command.
You can use following parameters:
Parameter | Description |
---|---|
resize | An optional parameter that can contain a positive integer value, indicating the percent of change in the size of the image from the original. This parameter is used to improve the quality of recognition, if the original size is recognized badly. |
do | Commands that should be run in the OCR context. |
The command expects image encoded as base64 in the register. After using the command, digger switches to the OCR context. In this context you can use the text command to transfer recognized text to the register.
Let's see example when page has some image that is embedded into HTML:
<img src="">
Following image is encoded in this base64 image:
To extract text, digger code will looks like:
# FIND ELEMENT WITH EMBEDDED IMAGE
- find:
path: img
do:
# PARSE ATTR `src` AND EXTRACT ONLY BASE64 ENCODED PART
- parse:
attr: src
filter: data\:image\/png\;base64\,(.+)
# REGISTER NOW HAS IMAGE ENCODED IN BASE64
# SWITCHING TO OCR CONTEXT
- ocr:
do:
# GET RECOGNIZED TEXT TO THE REGISTER
- text
# REGISTER VALUE: Hello world
Now let's see how we would do if there is link to image file:
---
config:
debug: 2
agent: Firefox
do:
# LOAD IMAGE (SAME AS WE LOADING ANY PAGE)
- walk:
to: https://www.diggernaut.com/sandbox/captcha_3.jpg
do:
# DIGGER WILL CONVERT BINARY IMAGE TO XML DOCUMENT WITH IMAGE ENCODED AS BASE64
# IMAGE WILL BE KEPT IN THE `imgbase64` TAG
- find:
path: imgbase64
do:
- parse
- ocr:
resize: 40
do:
- text
- variable_set: imgtext
# AS RESULT WE WILL GET THIS EXTRACTED TEXT: W68HP
Captcha on source website doesnt let you scrape the data? Learn how to bypass it.