Runtime Entities
Link Pool
A link pool is a list that contains the URLs.
It can be used, for example, to form a list with catalog pages and further iterate over this list.
Or, using the fact that all URLs in a list are unique and caanot be duplicated, it is very convenient to collect URLs of product pages
in a single pool from all pages of the catalog, and then iterating over the pool, collect all the goods. This will allow you to avoid duplicate of goods,
because in most stores the same goods can be in different categories. It saves your resources, makes your data set cleaner and reduces the loading
on the source site.
The five most important points about link pools:
- The command to add URLs to the pool can take the value from the register, a string value, or a compound value with arguments or variables
- Pools exist in all contexts, are context-independent and can be used in any context
- There can be an infinite number of pools
- URLs added to the pool will be automatically normalized
- All values (URLs) in the pool are unique, so there can not be two identical addresses in the list
Example of using link pool:
---
config:
debug: 2
agent: Firefox
do:
- walk:
to: https://www.diggernaut.com/sandbox/meta-lang-pool-en.html
do:
- find:
path: body > a
do:
- parse:
attr: href
- normalize:
routine: url
- link_add:
pool: main
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Diggernaut | Meta-Language | Pool of links sample</title>
</head>
<body>
<a href="https://www.diggernaut.com/sandbox/table-nested-data-en.html">Link N1</a>
<br/>
<a href="https://www.diggernaut.com/sandbox/meta-lang-object-en.html">Link N2</a>
<br/>
<a href="https://www.diggernaut.com/sandbox/meta-lang-register-en.html">Link N3</a>
</body>
</html>
Time | Level | Message |
---|---|---|
2017-10-22 00:02:56:111 | info | Scrape is done |
2017-10-22 00:02:56:104 | debug | Link added |
2017-10-22 00:02:56:097 | debug | Adding link to the pool main: https://www.diggernaut.com/sandbox/meta-lang-register-en.html |
2017-10-22 00:02:56:089 | debug | Results: https://www.diggernaut.com/sandbox/meta-lang-register-en.html |
2017-10-22 00:02:56:082 | debug | Applying normalization: url |
2017-10-22 00:02:56:074 | debug | Parsed content: https://www.diggernaut.com/sandbox/meta-lang-register-en.html |
2017-10-22 00:02:56:067 | debug | Parsing attribute: href |
2017-10-22 00:02:56:060 | debug | Parsing block with arguments: map[attr:href] |
2017-10-22 00:02:56:053 | debug | Block content: Link N3 |
2017-10-22 00:02:56:045 | debug | Link added |
2017-10-22 00:02:56:038 | debug | Adding link to the pool main: https://www.diggernaut.com/sandbox/meta-lang-object-en.html |
2017-10-22 00:02:56:028 | debug | Results: https://www.diggernaut.com/sandbox/meta-lang-object-en.html |
2017-10-22 00:02:56:018 | debug | Applying normalization: url |
2017-10-22 00:02:56:010 | debug | Parsed content: https://www.diggernaut.com/sandbox/meta-lang-object-en.html |
2017-10-22 00:02:56:002 | debug | Parsing attribute: href |
2017-10-22 00:02:55:994 | debug | Parsing block with arguments: map[attr:href] |
2017-10-22 00:02:55:987 | debug | Block content: Link N2 |
2017-10-22 00:02:55:979 | debug | Link added |
2017-10-22 00:02:55:972 | debug | Adding link to the pool main: https://www.diggernaut.com/sandbox/table-nested-data-en.html |
2017-10-22 00:02:55:965 | debug | Results: https://www.diggernaut.com/sandbox/table-nested-data-en.html |
2017-10-22 00:02:55:957 | debug | Applying normalization: url |
2017-10-22 00:02:55:949 | debug | Parsed content: https://www.diggernaut.com/sandbox/table-nested-data-en.html |
2017-10-22 00:02:55:941 | debug | Parsing attribute: href |
2017-10-22 00:02:55:933 | debug | Parsing block with arguments: map[attr:href] |
2017-10-22 00:02:55:925 | debug | Block content: Link N1 |
2017-10-22 00:02:55:917 | debug | Number of found blocks: 3 |
2017-10-22 00:02:55:910 | debug | Looking for: body > a |
2017-10-22 00:02:55:898 | debug | Page content: <html lang="en"> <head> <meta charset="UTF-8"/> <title>Diggernaut | Meta-Language | Pool of links sample</title> </head> <body> <a href="https://www.diggernaut.com/sandbox/table-nested-data-en.html">Link N1</a> <br/> <a href="https://www.diggernaut.com/sandbox/meta-lang-object-en.html">Link N2</a> <br/> <a href="https://www.diggernaut.com/sandbox/meta-lang-register-en.html">Link N3</a> </body> </html> |
2017-10-22 00:02:55:642 | info | Retrieving page (GET): https://www.diggernaut.com/sandbox/meta-lang-pool-en.html |
2017-10-22 00:02:55:635 | info | Starting scrape |
2017-10-22 00:02:55:621 | debug | Setting up default proxy |
2017-10-22 00:02:55:608 | debug | Setting up surf |
2017-10-22 00:02:55:581 | info | Starting digger: meta-lang-pool [1855] |