Methods for Navigation

Navigation is used to load various pages, documents and files on the website, as well as to traverse over the DOM structure of the loaded document.

Walk

The walk method is used to load pages and other documents (json, js, ical, xml, images) from various web resources or websites. If the downloaded file is presented in a format other than HTML or XML, the digger automatically converts the content of the resource into XML. It works this way so you can use the same approach for extracting data from heterogeneous resources.

The main points of the walk method:

The method can be called from any context

Can work with the contents of the register, use the values of arguments and variables as data for the substitution

The execution of the block logic can be looped until a certain condition is reached

Can iterate over link pool

It is possible to use custom request headers

The method can do GET and POST requests

If a page or document is successfully loaded, the digger goes into a page context and works with the downloaded content

Parameters that you can use in the walk method:

Parameter	Description
to	The value that defines which request the digger should make. If the value is a literal, the GET request will be executed. If the dictionary - POST request will be done. When using a literal, you can use the URL of the resource that the digger should download. It is possible to use variables and arguments in a URL. If you want to use a URL value from the register, you can use the reserved word value. And if you want a digger to iterate over the link pool, use the word pool. To do POST request, you will need to make a dictionary with fields described below and use this dictionary as value for parameter to.
headers	A dictionary where you can include any headers that will be sent to the server with the request. You can use any standard and non-standard headers, except for user-agent. User agent header is populated with value you define in config section of the digger's configuration.
mode	Enables mode that only unique URLs (across all digger sessions) will be loaded. To enable this mode, it is enough to specify the value of this parameter as unique. In this mode, the digger will be cache all downloaded URLs in the database and the next time you try access the URL, it will check the database if this URL has been fetched before. And if its so, such URL will be skipped. In some cases, this mode helps to save on resources (page requests) you pay for.
pool	The name of the link pool. Used only if the reserved word links is used as value for the parameter to. If this parameter is omitted, then the digger will use the default pool.
repeat	A special flag that sets execution the block of walk command to the loop while the value of this flag is equivalent to "yes". In practice, there is variable used as value of this flag, which initially is set to "yes". Then during execution of the loop, when digger meet some condition, it changes variable value to something other and digger breaks out of the loop and continues execution of code outside of this walk block.
repeat_in_pool	Works just the same as repeat, but for link pool.

GET

The following are examples of GET requests with some parameters:

URL
Links
Repeat
Headers

              ---
do:
# LOADING PAGE LOCATED AT SPECIFIED URL AND RUN LOGIC INSIDE THE `walk` BLOCK FOR ITS CONTENT
- walk:
    to: http://www.somesite.com/
    do:
    # FIND ALL LINKS OF THIS PAGE
    - find:
        path: a
        do:
        # PUT VALUE OF `href` ATTRIBUTE TO THE REGISTER
        - parse:
            attr: href
        # LOAD PAGE WITH URL WE HAVE IN REGISTER
        - walk:
            to: value
            do:

              ---
do:
# ADD URL OF PAGES TO THE LINK POOL WITH NAME `somepool`
- link_add:
    pool: somepool
    url:
    - http://www.somesite.com/page-1/
    - http://www.somesite.com/page-2/
    - http://www.somesite.com/page-3/
# ITERATING OVER POOL (OVER URLS ONE BY ONE)
# FOR EACH URL WE RUN LOGIC INSIDE `walk` BLOCK
- walk:
    to: links
    pool: somepool
    do: 
    - find:
        path: .somepath
        do:

            ---
do:
# DECLARE VARIABLE `repeatable` AND SET IT TO `yes`
- variable_set:
    field: repeatable
    value: 'yes'
# LETS IMAGINE THAT WEBSITE WE ARE SCRAPING IS NOT STABLE
# AND SOMETIMES DOESNT RETURN PROPER PAGE, OR JUST NOT AVAILABLE
# LETS PUT `walk` COMMAND TO THE LOOP USING VARIABLE `repeatable`
# COMMAND `walk` WILL BE REPEATED UNTIL SPECIFIC CSS PATH `.somepath`
# IS NOT FOUND ON THE LOADED PAGE
- walk:
    repeat: <%repeatable%>
    to: http://www.somesite.com/
    do:
    - find:
        path: .somepath
        do:
        # CSS PATH IS FOUND, LETS CLEAR VARIABLE TO STOP LOOPING `walk` COMMAND
        - variable_clear: repeatable

              ---
do:
# LOAD PAGE LOCATED AT GIVEN URL WITH COMMAND `walk`
- walk:
    to: http://www.somesite.com/
    # WE ARE GOING TO SEND SOME HEADERS WITH PAGE REQUEST
    headers:
        Cookie: JSESSIONID=1234123412321; OTHERCOOKIE=<%somevar%>;
        Accept: text/xml
    do:
    - find:
        path: .somepath
        do:

POST

To do POST request, you need to use specifically formed dictionary in to parameter:

Parameter	Description
post	URL of web resource, where your POST request with data formed as X-WWW-FORM-URLENCODED should be sent to.
json	URL of web resource, where your POST request with data formed as APPLICATION/JSON should be sent to.
xml	URL of web resource, where your POST request with data formed as TEXT/XML should be sent to. Data should be provided using payload parameter only.
graphql	URL of web resource, where your POST request with data formed as APPLICATION/GRAPHQL should be sent to. Data should be provided using payload parameter only.
headers	A dictionary where you can include any headers that will be sent to the server with the request. You can use any standard and non-standard headers, except for user-agent. User agent header is populated with value you define in config section of the digger's configuration. Attention, headers for POST requests should be used in the to scope, not in the root walk scope as for GET requests.
data	A dictionary with all fields/values of query that should be sent with the request. Field names and values are allowed to use variables and arguments to substitute data. The maximum nesting level of the dictionary is 2. If your data in JSON format should have a deeper level of nesting, use the payload parameter.
payload	A string in the JSON/XML/GraphQL format, which is passed instead of the data parameter for APPLICATION/JSON, TEXT/XML and APPLICATION/GRAPHQL queries.

Few examples of POST requests.

Digger configuration (X-WWW-FORM-URLENCODED)
Execution log

            ---
config:
  debug: 2
do:
- walk:
  to:
      post: https://mockbin.org/request
      data:
          fizz: buzz
  do:

Time	Level	Message
2017-10-23 22:02:30:452	info	Scrape is done
2017-10-23 22:02:30:436	debug	Page content: <html><head></head><body><body_safe> <bodysize>9</bodysize> <clientipaddress>1.1.1.1</clientipaddress> <cookies></cookies> <headers> <accept-encoding>gzip</accept-encoding> <cf-connecting-ip>1.1.1.1</cf-connecting-ip> <cf-visitor>{"scheme":"https"}</cf-visitor> <connect-time>2</connect-time> <connection>close</connection> <content-length>9</content-length> <content-type>application/x-www-form-urlencoded</content-type> <host>mockbin.org</host> <total-route-time>0</total-route-time> <user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent> <via>1.1 vegur</via> <x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for> <x-forwarded-port>80</x-forwarded-port> <x-forwarded-proto>http</x-forwarded-proto> <x-request-start>1508785350353</x-request-start> </headers> <headerssize>556</headerssize> <httpversion>HTTP/1.1</httpversion> <method>POST</method> <postdata> <mimetype>application/x-www-form-urlencoded</mimetype> <params> <fizz>buzz</fizz> </params> <text>fizz=buzz</text> </postdata> <querystring></querystring> <starteddatetime>2017-10-23T19:02:30.355Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html>
2017-10-23 22:02:29:405	info	Retrieving page (POST): https://mockbin.org/request
2017-10-23 22:02:29:398	info	Starting scrape
2017-10-23 22:02:29:382	debug	Setting up default proxy
2017-10-23 22:02:29:367	debug	Setting up surf
2017-10-23 22:02:29:336	info	Starting digger: meta-lang-post-x-www [1862]

Note, since the mockbin.org server sends the response in JSON format, the digger has made the conversion of the response to XML.

Digger configuration (APPLICATION/JSON)
Execution log

              ---
config:
    debug: 2
do:
# LETS INITIALIZE COUPLE VARIABLES
- variable_set:
    field: field_name
    value: age
- variable_set:
    field: field_value
    value: 25
- walk:
    to:
        json: https://mockbin.org/request
        data:
            fizz: buzz
            <%field_name%>: <%field_value%>
    do:

Time	Level	Message
2017-10-24 01:31:08:538	info	Scrape is done
2017-10-24 01:31:08:523	debug	Page content: <html><head></head><body><body_safe> <bodysize>26</bodysize> <clientipaddress>1.1.1.1</clientipaddress> <cookies></cookies> <headers> <accept-encoding>gzip</accept-encoding> <cf-connecting-ip>1.1.1.1</cf-connecting-ip> <cf-visitor>{"scheme":"https"}</cf-visitor> <connect-time>1</connect-time> <connection>close</connection> <content-length>26</content-length> <content-type>application/json</content-type> <host>mockbin.org</host> <total-route-time>0</total-route-time> <user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent> <via>1.1 vegur</via> <x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for> <x-forwarded-port>80</x-forwarded-port> <x-forwarded-proto>http</x-forwarded-proto> <x-request-start>1508797868503</x-request-start> </headers> <headerssize>539</headerssize> <httpversion>HTTP/1.1</httpversion> <method>POST</method> <postdata> <mimetype>application/json</mimetype> <params></params> <text>{"age":"25","fizz":"buzz"}</text> </postdata> <querystring></querystring> <starteddatetime>2017-10-23T22:31:08.509Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html>
2017-10-24 01:31:08:052	info	Retrieving page (POST/JSON): https://mockbin.org/request
2017-10-24 01:31:08:044	debug	Variable field_value has been set to value: 25
2017-10-24 01:31:08:035	debug	Variable field_name has been set to value: age
2017-10-24 01:31:08:028	info	Starting scrape
2017-10-24 01:31:08:015	debug	Setting up default proxy
2017-10-24 01:31:08:002	debug	Setting up surf
2017-10-24 01:31:07:971	info	Starting digger: meta-lang-post-json [1863]

Digger configuration (APPLICATION/JSON PAYLOAD)
Execution log

              ---
config:
    debug: 2
do:
- variable_set:
    field: age
    value: 25
- walk:
    to:
        json: https://mockbin.org/request
        payload: '{"fizz":"buzz","age":"<%age%>"}'
    do:

Time	Level	Message
2017-10-24 02:00:06:387	info	Scrape is done
2017-10-24 02:00:06:374	debug	Page content: <html><head></head><body><body_safe> <bodysize>26</bodysize> <clientipaddress>1.1.1.1</clientipaddress> <cookies></cookies> <headers> <accept-encoding>gzip</accept-encoding> <cf-connecting-ip>1.1.1.1</cf-connecting-ip> <cf-visitor>{"scheme":"https"}</cf-visitor> <connect-time>1</connect-time> <connection>close</connection> <content-length>26</content-length> <content-type>application/json</content-type> <host>mockbin.org</host> <total-route-time>0</total-route-time> <user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent> <via>1.1 vegur</via> <x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for> <x-forwarded-port>80</x-forwarded-port> <x-forwarded-proto>http</x-forwarded-proto> <x-request-start>1508799606293</x-request-start> </headers> <headerssize>540</headerssize> <httpversion>HTTP/1.1</httpversion> <method>POST</method> <postdata> <mimetype>application/json</mimetype> <params></params> <text>{"fizz":"buzz","age":"25"}</text> </postdata> <querystring></querystring> <starteddatetime>2017-10-23T23:00:06.298Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html>
2017-10-24 02:00:05:098	info	Retrieving page (POST/JSON): https://mockbin.org/request
2017-10-24 02:00:05:089	debug	Variable age has been set to value: 25
2017-10-24 02:00:05:081	info	Starting scrape
2017-10-24 02:00:05:069	debug	Setting up default proxy
2017-10-24 02:00:05:062	debug	Setting up surf
2017-10-24 02:00:05:035	info	Starting digger: meta-lang-post-payload [1864]

In the next part, we'll learn the find method. It is used to navigate the DOM structure of the loaded document.