Conditional Flow
Using If
When you scraping something, you often end up with situations when some operations should be done depending on certain condition. To check the condition, you can use the if command. This command works with strings, integer and floating-point values. The command compare the value of the register with explicitly given value as parameter, so command can only be used in a block context.
You can use following parameters:
Parameter | Description |
---|---|
match|eq|gt|lt|nlt | Modes, which command work with: match - checks whether the value of the register contains the passed parameter; eq - checks whether the passed value is equal to the value in the register; gt - checks whether the value of the register is greater than the passed parameter; lt - checks whether the value of the register is less than the passed parameter; nlt - checks whether the value of the register is not less than the passed parameter. |
type | Values type for comparison: string - for string values comparison; int - for integer values comparison; float - for float values comparison. If not passed, default string type is used. |
do | A block of commands to execute if the condition is met. Optional parameter. |
else | A block of commands to execute if the condition is not met. Optional parameter. |
Examples of if notation:
# COMPARISON USING REGULAR EXPRESSION
- if:
match: regex
# IF MATCHED REGEX - BLOCK `do` WILL BE EXECUTED
do:
..
..
# IF NOT - BLOCK `else` WILL BE EXECUTED
else:
..
..
# COMPARISON USING REGULAR EXPRESSION
# BUT ONLY `do` USED, WITHOUT `else` LOGIC BLOCK
# WORKS SAME WAY IN OTHER MODES
- if:
match: regex
# IF MATCHED REGEX - BLOCK `do` WILL BE EXECUTED
do:
..
..
# COMPARISON USING REGULAR EXPRESSION
# BUT ONLY `else` USED
# WORKS SAME WAY IN OTHER MODES
- if:
match: regex
# IF NOT MATCHED - BLOCK `else` WILL BE EXECUTED
else:
..
..
# COMPARISON OF INTEGER VALUES
- if:
eq: 0
type: int
# IF TRUE - BLOCK `do` WILL BE EXECUTED
do:
..
..
# IF FALSE - BLOCK `else` WILL BE EXECUTED
else:
..
..
# COMPARISON OF INTEGER VALUES
- if:
gt: 0
type: int
# IF REGISTER VALUE IS GREATER THAN 0, BLOCK `do` WILL BE EXECUTED
do:
..
..
# IN OTHER CASE - BLOCK `else` WILL BE EXECUTED
else:
..
..
- if:
lt: 0
type: int
# IF REGISTER VALUE IS LESS THAN 0, BLOCK `do` WILL BE EXECUTED
do:
..
..
# IN OTHER CASE - BLOCK `else` WILL BE EXECUTED
else:
..
..
- if:
nlt: 0
type: int
# IF REGISTER VALUE IS NOT LESS THAN 0, BLOCK `do` WILL BE EXECUTED
do:
..
..
# IN OTHER CASE - BLOCK `else` WILL BE EXECUTED
else:
..
..
Let's overview different cases of using if command, and use following HTML source for it:
<ul class="list">
<li class="list-item" id="1">Some text</li>
<li class="list-item" id="item=2"><a href="http://somesite.com/">Link</a></li>
<li class="list-item" id="item=3">Some other text</li>
</ul>
Example of match mode usage:
# FIND ALL `li`
- find:
path: li
do:
- parse
# CHECK IF THERE IS WORD `text` IN THE REGISTER
- if:
match: text
do:
# IF TRUE, SET OBJECT FIELD WITH THE VALUE OF THE REGISTER
- object_field_set:
object: someobj
field: somefield
# FIND ALL `li`
- find:
path: li
do:
- parse
# CHECK IF THERE IS WORD `text` IN THE REGISTER
- if:
match: text
# IF NOT FOUND, FIND `a`
else:
- find:
path: a
do:
# PARSE ATTRIBUTE `href` TO THE REGISTER
- parse:
attr: href
# NORMALIZE URL
- normalize:
routine: url
# LOAD PAGE LOCATED AT THAT URL
- walk:
to: value
do:
..
..
# FIND ALL `li`
- find:
path: li
do:
- parse
# CHECK IF THERE IS WORD `text` IN THE REGISTER
- if:
match: text
do:
# IF TRUE, SET OBJECT FIELD WITH THE VALUE OF THE REGISTER
- object_field_set:
object: someobj
field: somefield
# IF NOT FOUND, FIND `a`
else:
- find:
path: a
do:
# PARSE ATTRIBUTE `href` TO THE REGISTER
- parse:
attr: href
# NORMALIZE URL
- normalize:
routine: url
# LOAD PAGE LOCATED AT THAT URL
- walk:
to: value
do:
..
..
Examples of gt, lt, nlt, eq modes usage:
# FIND ALL `li`
- find:
path: li
do:
# PARSE ATTRIBUTE `id` VALUE AND EXTRACT ONLY DIGITS
- parse:
attr: id
filter:
- (\d+)
# CHECK IF VALUE OF THE REGISTER IS GREATER THAN `2`
- if:
gt: 2
# SPECIFY TYPE `integer`
type: int
do:
# IF ITS TRUE SET FIELD OF THE OBJECT TO THE REGISTER VALUE
- object_field_set:
object: someobj
field: somefield
# FIND ALL `li`
- find:
path: li
do:
# PARSE ATTRIBUTE `id` VALUE AND EXTRACT ONLY DIGITS
- parse:
attr: id
filter:
- (\d+)
# CHECK IF VALUE OF THE REGISTER IS LESS THAN `2`
- if:
lt: 2
# SPECIFY TYPE `integer`
type: int
do:
# IF ITS TRUE SET FIELD OF THE OBJECT TO THE REGISTER VALUE
- object_field_set:
object: someobj
field: somefield
# НАЙДЕМ ВСЕ `li`
- find:
path: li
do:
# PARSE ATTRIBUTE `id` VALUE AND EXTRACT ONLY DIGITS
- parse:
attr: id
filter:
- (\d+)
# CHECK IF VALUE OF THE REGISTER IS NOT LESS (GREATER OR EQUAL) THAN `2`
- if:
nlt: 2
# SPECIFY TYPE `integer`
type: int
do:
# IF ITS TRUE SET FIELD OF THE OBJECT TO THE REGISTER VALUE
- object_field_set:
object: someobj
field: somefield
# НАЙДЕМ ВСЕ `li`
- find:
path: li
do:
# PARSE ATTRIBUTE `id` VALUE AND EXTRACT ONLY DIGITS
- parse:
attr: id
filter:
- (\d+)
# CHECK IF VALUE OF THE REGISTER IS EQUAL `2`
- if:
eq: 1
# SPECIFY TYPE `integer`
type: int
else:
# IF ITS FALSE SET FIELD OF THE OBJECT TO THE REGISTER VALUE
- object_field_set:
object: someobj
field: somefield
In the next chapter, you will learn how to use optical text recognition (OCR) and text extraction from images.