OCR

Extracting Text from Images

To extract text from images, a method called optical character recognition (OCR) is used. To process images on the Diggernaut platform, the digger needs to load the image first (digger will automatically converts all images to XML with element where you can extract binary content encoded as base64), then extract base64 encoded image contents from XML to the register. And after it you can use the ocr command.

You can use following parameters:

Parameter Description
resize An optional parameter that can contain a positive integer value, indicating the percent of change in the size of the image from the original. This parameter is used to improve the quality of recognition, if the original size is recognized badly.
do Commands that should be run in the OCR context.

The command expects image encoded as base64 in the register. After using the command, digger switches to the OCR context. In this context you can use the text command to transfer recognized text to the register.

Let's see example when page has some image that is embedded into HTML:

          <img src="">
          

Following image is encoded in this base64 image:

To extract text, digger code will looks like:

          # FIND ELEMENT WITH EMBEDDED IMAGE
- find:
    path: img
    do:
    # PARSE ATTR `src` AND EXTRACT ONLY BASE64 ENCODED PART
    - parse:
        attr: src
        filter: data\:image\/png\;base64\,(.+)

    # REGISTER NOW HAS IMAGE ENCODED IN BASE64
    # SWITCHING TO OCR CONTEXT
    - ocr:
        do:
        # GET RECOGNIZED TEXT TO THE REGISTER
        - text
        # REGISTER VALUE: Hello world
            

Now let's see how we would do if there is link to image file:

          ---
config:
    debug: 2
    agent: Firefox
do:
# LOAD IMAGE (SAME AS WE LOADING ANY PAGE)
- walk:
    to: https://www.diggernaut.com/sandbox/captcha_3.jpg
    do:
    # DIGGER WILL CONVERT BINARY IMAGE TO XML DOCUMENT WITH IMAGE ENCODED AS BASE64
    # IMAGE WILL BE KEPT IN THE `imgbase64` TAG
    - find:
        path: imgbase64
        do:
        - parse
        - ocr:
            resize: 40
            do:
            - text
            - variable_set: imgtext
            # AS RESULT WE WILL GET THIS EXTRACTED TEXT: W68HP
          

Captcha on source website doesnt let you scrape the data? Learn how to bypass it.