Basic Settings
Tables Normalization
Probably you already met such tables, in which multiple cells or rows are combined into single cell. It makes the table more readable for humans and less readable for computer. This effect is achieved by using such attributes as rowspan and colspan and makes it very difficult to write a common logic for passing all rows of such table.
To simplify this task, we introduced a special parameter normalize_tables. When you use it, engine will automatically split all merged cells and same content will be used for all splitted cells. After it you can build logic to pass through stable table structure, and it will be much easier to do as you can use same logic for each row.
Sample of using tables normalization mechanism:
---
config:
debug: 2
agent: Firefox
# TURNING ON AUTOMATED TABLES NORMALIZATION
normalize_tables: yes
do:
- walk:
to: https://www.diggernaut.com/sandbox/meta-lang-normalize-tables-en.html
do:
- find:
path: '#table-1'
do:
- find:
path: '#table-2'
do:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Diggernaut | Meta-language | Normalize tables sample</title>
</head>
<body>
<table id="table-1" width="200" border="1" cellpadding="4" cellspacing="0">
<tr>
<td rowspan="2" bgcolor="#FBF0DB">Cell 1</td>
<td>Cell 2</td>
</tr>
<tr>
<td>Cell 3</td>
</tr>
</table>
<br/>
<table id="table-2" width="200" border="1" cellpadding="4" cellspacing="0">
<tr>
<td colspan="2" bgcolor="#FBF0DB">Cell 1</td>
</tr>
<tr>
<td>Cell 2</td>
<td>Cell 3</td>
</tr>
</table>
</body>
</html>
Time | Level | Message |
---|---|---|
2017-10-22 22:53:53:843 | info | Scrape is done |
2017-10-22 22:53:53:836 | debug | Block content:
<tbody> <tr> <td colspan="1" bgcolor="#FBF0DB">Cell 1</td> <td colspan="1" bgcolor="#FBF0DB">Cell 1</td> </tr> <tr> <td>Cell 2</td> <td>Cell 3</td> </tr> </tbody> |
2017-10-22 22:53:53:828 | debug | Number of found blocks: 1 |
2017-10-22 22:53:53:821 | debug | Looking for: #table-2 |
2017-10-22 22:53:53:814 | debug | Block content:
<tbody> <tr> <td rowspan="1" bgcolor="#FBF0DB">Cell 1</td> <td>Cell 2</td> </tr> <tr> <td rowspan="1" bgcolor="#FBF0DB">Cell 1</td> <td>Cell 3</td> </tr> </tbody> |
2017-10-22 22:53:53:808 | debug | Number of found blocks: 1 |
2017-10-22 22:53:53:801 | debug | Looking for: #table-1 |
2017-10-22 22:53:53:790 | debug | Page content: <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"/> <title>Diggernaut | Meta-language | Normalize tables sample</title> </head> <body> <table id="table-1" width="200" border="1" cellpadding="4" cellspacing="0"> <tbody> <tr> <td rowspan="1" bgcolor="#FBF0DB">Cell 1</td> <td>Cell 2</td> </tr> <tr> <td rowspan="1" bgcolor="#FBF0DB">Cell 1</td> <td>Cell 3</td> </tr> </tbody> </table> <br/> <table id="table-2" width="200" border="1" cellpadding="4" cellspacing="0"> <tbody> <tr> <td colspan="1" bgcolor="#FBF0DB">Cell 1</td> <td colspan="1" bgcolor="#FBF0DB">Cell 1</td> </tr> <tr> <td>Cell 2</td> <td>Cell 3</td> </tr> </tbody> </table> </body> </html> |
2017-10-22 22:53:52:945 | info | Retrieving page (GET): https://www.diggernaut.com/sandbox/meta-lang-normalize-tables-en.html |
2017-10-22 22:53:52:937 | info | Starting scrape |
2017-10-22 22:53:52:920 | debug | Setting up default proxy |
2017-10-22 22:53:52:906 | debug | Setting up surf |
2017-10-22 22:53:52:880 | info | Starting digger: meta-lang-normalize-tables [1857] |
By default tables normalization is turned off.
Please pay attention!
To enable the option, you must put yes value without quotes.