Runtime Entities

Link Pool

A link pool is a list that contains the URLs.
It can be used, for example, to form a list with catalog pages and further iterate over this list. Or, using the fact that all URLs in a list are unique and caanot be duplicated, it is very convenient to collect URLs of product pages in a single pool from all pages of the catalog, and then iterating over the pool, collect all the goods. This will allow you to avoid duplicate of goods, because in most stores the same goods can be in different categories. It saves your resources, makes your data set cleaner and reduces the loading on the source site.

The five most important points about link pools:

  1. The command to add URLs to the pool can take the value from the register, a string value, or a compound value with arguments or variables
  2. Pools exist in all contexts, are context-independent and can be used in any context
  3. There can be an infinite number of pools
  4. URLs added to the pool will be automatically normalized
  5. All values (URLs) in the pool are unique, so there can not be two identical addresses in the list

Example of using link pool:

              ---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: https://www.diggernaut.com/sandbox/meta-lang-pool-en.html
    do:
    - find:
        path: body > a
        do:
        - parse:
            attr: href
        - normalize:
            routine: url
        - link_add:
            pool: main
              
              <!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Diggernaut | Meta-Language | Pool of links sample</title>
</head>
<body>
<a href="https://www.diggernaut.com/sandbox/table-nested-data-en.html">Link N1</a>
<br/>
<a href="https://www.diggernaut.com/sandbox/meta-lang-object-en.html">Link N2</a>
<br/>
<a href="https://www.diggernaut.com/sandbox/meta-lang-register-en.html">Link N3</a>
</body>
</html>
              
Time Level Message
2017-10-22 00:02:56:111 info Scrape is done
2017-10-22 00:02:56:104 debug Link added
2017-10-22 00:02:56:097 debug Adding link to the pool main: https://www.diggernaut.com/sandbox/meta-lang-register-en.html
2017-10-22 00:02:56:089 debug Results: https://www.diggernaut.com/sandbox/meta-lang-register-en.html
2017-10-22 00:02:56:082 debug Applying normalization: url
2017-10-22 00:02:56:074 debug Parsed content: https://www.diggernaut.com/sandbox/meta-lang-register-en.html
2017-10-22 00:02:56:067 debug Parsing attribute: href
2017-10-22 00:02:56:060 debug Parsing block with arguments: map[attr:href]
2017-10-22 00:02:56:053 debug Block content: Link N3
2017-10-22 00:02:56:045 debug Link added
2017-10-22 00:02:56:038 debug Adding link to the pool main: https://www.diggernaut.com/sandbox/meta-lang-object-en.html
2017-10-22 00:02:56:028 debug Results: https://www.diggernaut.com/sandbox/meta-lang-object-en.html
2017-10-22 00:02:56:018 debug Applying normalization: url
2017-10-22 00:02:56:010 debug Parsed content: https://www.diggernaut.com/sandbox/meta-lang-object-en.html
2017-10-22 00:02:56:002 debug Parsing attribute: href
2017-10-22 00:02:55:994 debug Parsing block with arguments: map[attr:href]
2017-10-22 00:02:55:987 debug Block content: Link N2
2017-10-22 00:02:55:979 debug Link added
2017-10-22 00:02:55:972 debug Adding link to the pool main: https://www.diggernaut.com/sandbox/table-nested-data-en.html
2017-10-22 00:02:55:965 debug Results: https://www.diggernaut.com/sandbox/table-nested-data-en.html
2017-10-22 00:02:55:957 debug Applying normalization: url
2017-10-22 00:02:55:949 debug Parsed content: https://www.diggernaut.com/sandbox/table-nested-data-en.html
2017-10-22 00:02:55:941 debug Parsing attribute: href
2017-10-22 00:02:55:933 debug Parsing block with arguments: map[attr:href]
2017-10-22 00:02:55:925 debug Block content: Link N1
2017-10-22 00:02:55:917 debug Number of found blocks: 3
2017-10-22 00:02:55:910 debug Looking for: body > a
2017-10-22 00:02:55:898 debug Page content: <html lang="en">
<head>
<meta charset="UTF-8"/>
<title>Diggernaut | Meta-Language | Pool of links sample</title>
</head>
<body>
<a href="https://www.diggernaut.com/sandbox/table-nested-data-en.html">Link N1</a>
<br/>
<a href="https://www.diggernaut.com/sandbox/meta-lang-object-en.html">Link N2</a>
<br/>
<a href="https://www.diggernaut.com/sandbox/meta-lang-register-en.html">Link N3</a>
</body>
</html>
2017-10-22 00:02:55:642 info Retrieving page (GET): https://www.diggernaut.com/sandbox/meta-lang-pool-en.html
2017-10-22 00:02:55:635 info Starting scrape
2017-10-22 00:02:55:621 debug Setting up default proxy
2017-10-22 00:02:55:608 debug Setting up surf
2017-10-22 00:02:55:581 info Starting digger: meta-lang-pool [1855]