Methods for Navigation
Navigation is used to load various pages, documents and files on the website, as well as to traverse over the DOM structure of the loaded document.
Walk
The walk method is used to load pages and other documents (json, js, ical, xml, images) from various web resources or websites. If the downloaded file is presented in a format other than HTML or XML, the digger automatically converts the content of the resource into XML. It works this way so you can use the same approach for extracting data from heterogeneous resources.
The main points of the walk method:
- The method can be called from any context
- Can work with the contents of the register, use the values of arguments and variables as data for the substitution
- The execution of the block logic can be looped until a certain condition is reached
- Can iterate over link pool
- It is possible to use custom request headers
- The method can do GET and POST requests
- If a page or document is successfully loaded, the digger goes into a page context and works with the downloaded content
Parameters that you can use in the walk method:
Parameter | Description |
---|---|
to | The value that defines which request the digger should make. If the value is a literal, the GET request will be executed. If the dictionary - POST request will be done. When using a literal, you can use the URL of the resource that the digger should download. It is possible to use variables and arguments in a URL. If you want to use a URL value from the register, you can use the reserved word value. And if you want a digger to iterate over the link pool, use the word pool. To do POST request, you will need to make a dictionary with fields described below and use this dictionary as value for parameter to. |
headers | A dictionary where you can include any headers that will be sent to the server with the request. You can use any standard and non-standard headers, except for user-agent. User agent header is populated with value you define in config section of the digger's configuration. |
mode | Enables mode that only unique URLs (across all digger sessions) will be loaded. To enable this mode, it is enough to specify the value of this parameter as unique. In this mode, the digger will be cache all downloaded URLs in the database and the next time you try access the URL, it will check the database if this URL has been fetched before. And if its so, such URL will be skipped. In some cases, this mode helps to save on resources (page requests) you pay for. |
pool | The name of the link pool. Used only if the reserved word links is used as value for the parameter to. If this parameter is omitted, then the digger will use the default pool. |
repeat | A special flag that sets execution the block of walk command to the loop while the value of this flag is equivalent to "yes". In practice, there is variable used as value of this flag, which initially is set to "yes". Then during execution of the loop, when digger meet some condition, it changes variable value to something other and digger breaks out of the loop and continues execution of code outside of this walk block. |
repeat_in_pool | Works just the same as repeat, but for link pool. |
GET
The following are examples of GET requests with some parameters:
---
do:
# LOADING PAGE LOCATED AT SPECIFIED URL AND RUN LOGIC INSIDE THE `walk` BLOCK FOR ITS CONTENT
- walk:
to: http://www.somesite.com/
do:
# FIND ALL LINKS OF THIS PAGE
- find:
path: a
do:
# PUT VALUE OF `href` ATTRIBUTE TO THE REGISTER
- parse:
attr: href
# LOAD PAGE WITH URL WE HAVE IN REGISTER
- walk:
to: value
do:
---
do:
# ADD URL OF PAGES TO THE LINK POOL WITH NAME `somepool`
- link_add:
pool: somepool
url:
- http://www.somesite.com/page-1/
- http://www.somesite.com/page-2/
- http://www.somesite.com/page-3/
# ITERATING OVER POOL (OVER URLS ONE BY ONE)
# FOR EACH URL WE RUN LOGIC INSIDE `walk` BLOCK
- walk:
to: links
pool: somepool
do:
- find:
path: .somepath
do:
---
do:
# DECLARE VARIABLE `repeatable` AND SET IT TO `yes`
- variable_set:
field: repeatable
value: 'yes'
# LETS IMAGINE THAT WEBSITE WE ARE SCRAPING IS NOT STABLE
# AND SOMETIMES DOESNT RETURN PROPER PAGE, OR JUST NOT AVAILABLE
# LETS PUT `walk` COMMAND TO THE LOOP USING VARIABLE `repeatable`
# COMMAND `walk` WILL BE REPEATED UNTIL SPECIFIC CSS PATH `.somepath`
# IS NOT FOUND ON THE LOADED PAGE
- walk:
repeat: <%repeatable%>
to: http://www.somesite.com/
do:
- find:
path: .somepath
do:
# CSS PATH IS FOUND, LETS CLEAR VARIABLE TO STOP LOOPING `walk` COMMAND
- variable_clear: repeatable
---
do:
# LOAD PAGE LOCATED AT GIVEN URL WITH COMMAND `walk`
- walk:
to: http://www.somesite.com/
# WE ARE GOING TO SEND SOME HEADERS WITH PAGE REQUEST
headers:
Cookie: JSESSIONID=1234123412321; OTHERCOOKIE=<%somevar%>;
Accept: text/xml
do:
- find:
path: .somepath
do:
POST
To do POST request, you need to use specifically formed dictionary in to parameter:
Parameter | Description |
---|---|
post | URL of web resource, where your POST request with data formed as X-WWW-FORM-URLENCODED should be sent to. |
json | URL of web resource, where your POST request with data formed as APPLICATION/JSON should be sent to. |
xml | URL of web resource, where your POST request with data formed as TEXT/XML should be sent to. Data should be provided using payload parameter only. |
graphql | URL of web resource, where your POST request with data formed as APPLICATION/GRAPHQL should be sent to. Data should be provided using payload parameter only. |
headers | A dictionary where you can include any headers that will be sent to the server with the request. You can use any standard and non-standard headers, except for user-agent. User agent header is populated with value you define in config section of the digger's configuration. Attention, headers for POST requests should be used in the to scope, not in the root walk scope as for GET requests. |
data | A dictionary with all fields/values of query that should be sent with the request. Field names and values are allowed to use variables and arguments to substitute data. The maximum nesting level of the dictionary is 2. If your data in JSON format should have a deeper level of nesting, use the payload parameter. |
payload | A string in the JSON/XML/GraphQL format, which is passed instead of the data parameter for APPLICATION/JSON, TEXT/XML and APPLICATION/GRAPHQL queries. |
Few examples of POST requests.
---
config:
debug: 2
do:
- walk:
to:
post: https://mockbin.org/request
data:
fizz: buzz
do:
Time | Level | Message |
---|---|---|
2017-10-23 22:02:30:452 | info | Scrape is done |
2017-10-23 22:02:30:436 | debug | Page content: <html><head></head><body><body_safe>
<bodysize>9</bodysize>
<clientipaddress>1.1.1.1</clientipaddress>
<cookies></cookies>
<headers>
<accept-encoding>gzip</accept-encoding>
<cf-connecting-ip>1.1.1.1</cf-connecting-ip>
<cf-visitor>{"scheme":"https"}</cf-visitor>
<connect-time>2</connect-time>
<connection>close</connection>
<content-length>9</content-length>
<content-type>application/x-www-form-urlencoded</content-type>
<host>mockbin.org</host>
<total-route-time>0</total-route-time>
<user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent>
<via>1.1 vegur</via>
<x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for>
<x-forwarded-port>80</x-forwarded-port>
<x-forwarded-proto>http</x-forwarded-proto>
<x-request-start>1508785350353</x-request-start>
</headers>
<headerssize>556</headerssize>
<httpversion>HTTP/1.1</httpversion> <method>POST</method> <postdata> <mimetype>application/x-www-form-urlencoded</mimetype> <params> <fizz>buzz</fizz> </params> <text>fizz=buzz</text> </postdata> <querystring></querystring> <starteddatetime>2017-10-23T19:02:30.355Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html> |
2017-10-23 22:02:29:405 | info | Retrieving page (POST): https://mockbin.org/request |
2017-10-23 22:02:29:398 | info | Starting scrape |
2017-10-23 22:02:29:382 | debug | Setting up default proxy |
2017-10-23 22:02:29:367 | debug | Setting up surf |
2017-10-23 22:02:29:336 | info | Starting digger: meta-lang-post-x-www [1862] |
Note, since the mockbin.org server sends the response in JSON format, the digger has made the conversion of the response to XML.
---
config:
debug: 2
do:
# LETS INITIALIZE COUPLE VARIABLES
- variable_set:
field: field_name
value: age
- variable_set:
field: field_value
value: 25
- walk:
to:
json: https://mockbin.org/request
data:
fizz: buzz
<%field_name%>: <%field_value%>
do:
Time | Level | Message |
---|---|---|
2017-10-24 01:31:08:538 | info | Scrape is done |
2017-10-24 01:31:08:523 | debug | Page content: <html><head></head><body><body_safe>
<bodysize>26</bodysize>
<clientipaddress>1.1.1.1</clientipaddress>
<cookies></cookies>
<headers>
<accept-encoding>gzip</accept-encoding>
<cf-connecting-ip>1.1.1.1</cf-connecting-ip>
<cf-visitor>{"scheme":"https"}</cf-visitor>
<connect-time>1</connect-time>
<connection>close</connection>
<content-length>26</content-length>
<content-type>application/json</content-type>
<host>mockbin.org</host>
<total-route-time>0</total-route-time>
<user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent>
<via>1.1 vegur</via>
<x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for>
<x-forwarded-port>80</x-forwarded-port>
<x-forwarded-proto>http</x-forwarded-proto>
<x-request-start>1508797868503</x-request-start>
</headers>
<headerssize>539</headerssize>
<httpversion>HTTP/1.1</httpversion> <method>POST</method> <postdata> <mimetype>application/json</mimetype> <params></params> <text>{"age":"25","fizz":"buzz"}</text> </postdata> <querystring></querystring> <starteddatetime>2017-10-23T22:31:08.509Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html> |
2017-10-24 01:31:08:052 | info | Retrieving page (POST/JSON): https://mockbin.org/request |
2017-10-24 01:31:08:044 | debug | Variable field_value has been set to value: 25 |
2017-10-24 01:31:08:035 | debug | Variable field_name has been set to value: age |
2017-10-24 01:31:08:028 | info | Starting scrape |
2017-10-24 01:31:08:015 | debug | Setting up default proxy |
2017-10-24 01:31:08:002 | debug | Setting up surf |
2017-10-24 01:31:07:971 | info | Starting digger: meta-lang-post-json [1863] |
---
config:
debug: 2
do:
- variable_set:
field: age
value: 25
- walk:
to:
json: https://mockbin.org/request
payload: '{"fizz":"buzz","age":"<%age%>"}'
do:
Time | Level | Message |
---|---|---|
2017-10-24 02:00:06:387 | info | Scrape is done |
2017-10-24 02:00:06:374 | debug | Page content: <html><head></head><body><body_safe>
<bodysize>26</bodysize>
<clientipaddress>1.1.1.1</clientipaddress>
<cookies></cookies>
<headers>
<accept-encoding>gzip</accept-encoding>
<cf-connecting-ip>1.1.1.1</cf-connecting-ip>
<cf-visitor>{"scheme":"https"}</cf-visitor>
<connect-time>1</connect-time>
<connection>close</connection>
<content-length>26</content-length>
<content-type>application/json</content-type>
<host>mockbin.org</host>
<total-route-time>0</total-route-time>
<user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent>
<via>1.1 vegur</via>
<x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for>
<x-forwarded-port>80</x-forwarded-port>
<x-forwarded-proto>http</x-forwarded-proto>
<x-request-start>1508799606293</x-request-start>
</headers>
<headerssize>540</headerssize>
<httpversion>HTTP/1.1</httpversion> <method>POST</method> <postdata> <mimetype>application/json</mimetype> <params></params> <text>{"fizz":"buzz","age":"25"}</text> </postdata> <querystring></querystring> <starteddatetime>2017-10-23T23:00:06.298Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html> |
2017-10-24 02:00:05:098 | info | Retrieving page (POST/JSON): https://mockbin.org/request |
2017-10-24 02:00:05:089 | debug | Variable age has been set to value: 25 |
2017-10-24 02:00:05:081 | info | Starting scrape |
2017-10-24 02:00:05:069 | debug | Setting up default proxy |
2017-10-24 02:00:05:062 | debug | Setting up surf |
2017-10-24 02:00:05:035 | info | Starting digger: meta-lang-post-payload [1864] |
In the next part, we'll learn the find method. It is used to navigate the DOM structure of the loaded document.