Basic Settings

Tables Normalization

Probably you already met such tables, in which multiple cells or rows are combined into single cell. It makes the table more readable for humans and less readable for computer. This effect is achieved by using such attributes as rowspan and colspan and makes it very difficult to write a common logic for passing all rows of such table.

To simplify this task, we introduced a special parameter normalize_tables. When you use it, engine will automatically split all merged cells and same content will be used for all splitted cells. After it you can build logic to pass through stable table structure, and it will be much easier to do as you can use same logic for each row.

Sample of using tables normalization mechanism:

              ---
config:
    debug: 2
    agent: Firefox
    # TURNING ON AUTOMATED TABLES NORMALIZATION
    normalize_tables: yes
do:
- walk:
    to: https://www.diggernaut.com/sandbox/meta-lang-normalize-tables-en.html
    do:
    - find:
        path: '#table-1'
        do:
    - find:
        path: '#table-2'
        do:
              
              <!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Diggernaut | Meta-language | Normalize tables sample</title>
</head>
<body>

<table id="table-1" width="200" border="1" cellpadding="4" cellspacing="0">
  <tr>
    <td rowspan="2" bgcolor="#FBF0DB">Cell 1</td>
    <td>Cell 2</td>
  </tr>
  <tr>
    <td>Cell 3</td>
  </tr>
</table>

<br/>

<table id="table-2" width="200" border="1" cellpadding="4" cellspacing="0">
  <tr>
    <td colspan="2" bgcolor="#FBF0DB">Cell 1</td>
  </tr>
  <tr>
    <td>Cell 2</td>
    <td>Cell 3</td>
  </tr>
</table>

</body>
</html>
              
Time Level Message
2017-10-22 22:53:53:843 info Scrape is done
2017-10-22 22:53:53:836 debug Block content: <tbody>
<tr>
<td colspan="1" bgcolor="#FBF0DB">Cell 1</td>
<td colspan="1" bgcolor="#FBF0DB">Cell 1</td>
</tr> <tr>
<td>Cell 2</td>
<td>Cell 3</td>
</tr>
</tbody>
2017-10-22 22:53:53:828 debug Number of found blocks: 1
2017-10-22 22:53:53:821 debug Looking for: #table-2
2017-10-22 22:53:53:814 debug Block content: <tbody>
<tr>
<td rowspan="1" bgcolor="#FBF0DB">Cell 1</td>
<td>Cell 2</td>
</tr>
<tr>
<td rowspan="1" bgcolor="#FBF0DB">Cell 1</td>
<td>Cell 3</td>
</tr>
</tbody>
2017-10-22 22:53:53:808 debug Number of found blocks: 1
2017-10-22 22:53:53:801 debug Looking for: #table-1
2017-10-22 22:53:53:790 debug Page content: <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8"/>
<title>Diggernaut | Meta-language | Normalize tables sample</title>
</head>
<body>
<table id="table-1" width="200" border="1" cellpadding="4" cellspacing="0">
<tbody>
<tr>
<td rowspan="1" bgcolor="#FBF0DB">Cell 1</td>
<td>Cell 2</td>
</tr>
<tr>
<td rowspan="1" bgcolor="#FBF0DB">Cell 1</td>
<td>Cell 3</td>
</tr>
</tbody>
</table>
<br/>
<table id="table-2" width="200" border="1" cellpadding="4" cellspacing="0">
<tbody>
<tr>
<td colspan="1" bgcolor="#FBF0DB">Cell 1</td>
<td colspan="1" bgcolor="#FBF0DB">Cell 1</td>
</tr>
<tr>
<td>Cell 2</td>
<td>Cell 3</td>
</tr>
</tbody>
</table>
</body>
</html>
2017-10-22 22:53:52:945 info Retrieving page (GET): https://www.diggernaut.com/sandbox/meta-lang-normalize-tables-en.html
2017-10-22 22:53:52:937 info Starting scrape
2017-10-22 22:53:52:920 debug Setting up default proxy
2017-10-22 22:53:52:906 debug Setting up surf
2017-10-22 22:53:52:880 info Starting digger: meta-lang-normalize-tables [1857]

By default tables normalization is turned off.

Please pay attention!
To enable the option, you must put yes value without quotes.