Iterators
Iterator by Dates
Iterators like date are used in cases where you need to create a set of ordered dates. For example, if you use the search form with date field on the source site to retrieve data just for specific date (or period). When you set up such iterator, you can specify the start date, the interval in days between each iteration, the period that sets the end date for the iterations, and a template that describes format of dates for arguments.
Parameter | Description |
---|---|
type | The constant that defines the iterator type, has the value date. |
start | Start date for iterations in YYYY-MM-DD format, if omitted, system will use current date as start date (optional). |
end | End date for iterations in YYYY-MM-DD format, if omitted, system will use period parameter (optional). |
period | Period duration in days between start and end dates, if omitted, defaulted to 60 days (optional). |
interval | Interval in days between iterations. Eg if you want to have agruments for each date between start and end dates, set interval to 1. If you need weekly intervals, set it to 7. If parameter is omitted, system will use default value 1. (optional). |
template | Template, used to format values in argument fields start_date and end_date, formed by iterator (optional). |
In the table below, you can find all possible tags that can be used in the template and examples of usage:
Tag | Description | Template Example | Value Sample |
---|---|---|---|
%a | abbreviation for weekdays, eg Mon or Fri | %a, %d %B | Fri, 20 February |
%A | weekday, eg Monday or Friday | %A, %d %B | Friday, 20 February |
%b | month abbreviation, eg Feb or Sep | %A, %d %b | Friday, 20 Jun |
%B | month name, eg February or September | %A, %d %B | Friday, 20 June |
%C | number of century, takes values from 00 to 99 | %С/%y | 20/17 |
%d | day of month, takes values from 01 to 31 | %Y-%m-%d | 2017-10-01 |
%D | preset template, same as %m/%d/%y | %D | 05/08/17 |
%e | day of month, takes values from 1 to 31 | %e %B | 5 January |
%F | preset template, same as %Y-%m-%d | %F | 2017-10-01 |
%g | 2-digit number of year according to ISO-8601:1988 standard | %g | 17 |
%G | 4-digit number of year according to ISO-8601:1988 standard | %G | 2017 |
%h | same as %b% | %A, %d %h | Friday, 20 Jun |
%H | hour in 24-hours system, takes values from 00 to 23 | %H:%M:%S | 08:35:26 |
%I | hour in 12-hours system, takes values from 01 to 12 | %H:%M:%S | 08:35:26 |
%j | number of day of year, takes values from 1 to 366 | Today is %j day of year | Today is 183 day of year |
%k | hour in 24-hours system, takes values from 0 to 23 | %k hrs %M mnt | 8 hrs 35 mnt |
%l | hour in 12-hours system, takes values from 1 to 12 | %l hrs %M mnt | 8 hrs 35 mnt |
%m | number of month, takes values from 01 to 12 | %Y-%m-%d | 2017-10-01 |
%l | minutes, takes values from 00 to 59 | %l hrs %M mnt | 8 hrs 35 mnt |
%n | new line symbol | %Y%n%m | 2017\n10 |
%p | value AM or PM depending on time, used with 12-hours time system | %I%p | 8AM |
%P | value am or pm depending on time, used with 12-hours time system | %I%P | 8am |
%r | same as %I:%M:%S %p | %r | 04:12:37 PM |
%R | same as %H:%M | %R | 22:35 |
%s | Unix timestamp, shows number of seconds since start of epoch (1 january 1970) | %s | 1506867213 |
%S | seconds, takes values from 00 to 59 | %H:%M:%S | 08:35:26 |
%t | tabulation symbol | %Y%t%m | 2017\t10 |
%T | same as %H:%M:%S | %T | 08:35:26 |
%u | number of weekday from 1 (monday) to 7 (sunday) | Today is %u week day | Today is 5 week day |
%U | number of week of year, if week starts with Sunday, takes values from 00 to 53 | It was %U week | It was 23 week |
%V | number of week of year by ISO standard, if week starts with Monday, takes values from 01 to 53. If week with 1 Jan has 4 or more days in new year, this week is counted as first week of new year, in other case its counted as last week of previous year. | It was %V week | It was 23 week |
%w | number of day of week from 0 (sunday) to 6 (saturday) | Today is %w day of week | Today is 5 day of week |
%W | number of week of year, if week starts with monday, takes values from 00 to 53 | It was %W week | It was 23 week |
%y | 2-digits number of year | %m/%d/%y | 10/01/17 |
%Y | 4-digits number of year | %Y-%m-%d | 2017-10-01 |
%z | time correction value to UTC time. Showing in format like +HHMM or -HHMM, where + means east from GMT, - means west from GMT, HH - number of hours, MM - number of minutes. | %z | +0300 |
%Z | abbreviation fo timezone | %Z | PST |
%+ | same as %a %b %e %H:%M:%S %Z %Y | %+ | Mon Sep 20 13:24:55 PST 2017 |
%% | symbol % | %Y%%%m | 2017%10 |
If you dont use template, start_date and end_date will be using ISO standard when formatted. In addition to these two arguments, there are some other agruments in the set that can be very useful in many cases:
Argument | Description |
---|---|
start_date | date of the interval start, in the format described by template or ISO standard |
end_date | date of the interval end, in the format described by template or ISO standard |
start_year | year of the interval start, in %Y (YYYY) format |
end_year | year of the interval end, in %Y (YYYY) format |
start_yr | year of the interval start, in %y (YY) format |
end_yr | year of the interval end, in %y (YY) format |
start_month | month of the interval start, in %m (MM) format |
end_month | month of the interval end, in %m (MM) format |
Example of iterator by dates:
iterator:
- type: date
# SET INTERVAL FOR EVERY 2 DAYS
interval: 2
# PERIOD BETWEEN START DATE (IN THIS CASE CURRENT DATE, BECAUSE START DATE PARAMENTER IS OMITTED) AND END DATE IS SET TO 10 DAYS
period: 10
# TEMPLATE FOR `start_date` AND `end_date`
template: '%B %d %Y'
As a result, we get the following list of fieldsets for each of which the digger will execute the main logic block:
[
{
"start_date": "October 01 2017", "end_date": "October 02 2017",
"start_year": "2017", "end_year": "2017",
"start_yr": "17", "end_yr": "17",
"start_month": "10", "end_month": "10"
},
{
"start_date": "October 03 2017", "end_date": "October 04 2017",
"start_year": "2017", "end_year": "2017",
"start_yr": "17", "end_yr": "17",
"start_month": "10", "end_month": "10"
},
{
"start_date": "October 05 2017", "end_date": "October 06 2017",
"start_year": "2017", "end_year": "2017",
"start_yr": "17", "end_yr": "17",
"start_month": "10", "end_month": "10"
},
{
"start_date": "October 07 2017", "end_date": "October 08 2017",
"start_year": "2017", "end_year": "2017",
"start_yr": "17", "end_yr": "17",
"start_month": "10", "end_month": "10"
},
{
"start_date": "October 09 2017", "end_date": "October 10 2017",
"start_year": "2017", "end_year": "2017",
"start_yr": "17", "end_yr": "17",
"start_month": "10", "end_month": "10"
}
]
Iterators by date are very often used to organize incremental data collection, which allows you to save resources and perform the task faster.
Example of using date iterator in the digger:
---
config:
debug: 2
agent: Firefox
iterator:
type: date
start: '2017-10-01'
period: 4
interval: 2
template: '%Y-%m-%d'
do:
- walk:
to: https://www.diggernaut.com/sandbox/meta-lang-object-en.html?from=<%start_date%>&to=<%end_date%>
do:
Time | Level | Message |
---|---|---|
2017-10-23 14:23:41:335 | info | Scrape is done |
2017-10-23 14:23:41:321 | debug | Page content: <!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8"/> <title>Diggernaut | Meta-language | Object sample</title> </head> <body> <h1>Title-1</h1> <p>Lorem ipsum dolor sit amet.</p> </body></html> |
2017-10-23 14:23:41:166 | debug | Referers: Referer: https://www.diggernaut.com/sandbox/meta-lang-object-en.html?from=2017-10-01&to=2017-10-02 |
2017-10-23 14:23:41:158 | debug | Referer: https://www.diggernaut.com/sandbox/meta-lang-object-en.html?from=2017-10-01&to=2017-10-02 |
2017-10-23 14:23:41:150 | info | Retrieving page (GET): https://www.diggernaut.com/sandbox/meta-lang-object-en.html?from=2017-10-03&to=2017-10-04 |
2017-10-23 14:23:41:138 | debug | Page content: <!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8"/> <title>Diggernaut | Meta-language | Object sample</title> </head> <body> <h1>Title-1</h1> <p>Lorem ipsum dolor sit amet.</p> </body></html> |
2017-10-23 14:23:40:185 | info | Retrieving page (GET): https://www.diggernaut.com/sandbox/meta-lang-object-en.html?from=2017-10-01&to=2017-10-02 |
2017-10-23 14:23:40:178 | info | Starting scrape |
2017-10-23 14:23:40:166 | debug | Setting up default proxy |
2017-10-23 14:23:40:153 | debug | Setting up surf |
2017-10-23 14:23:40:125 | info | Starting digger: meta-lang-iterator [1859] |
Next we will learn more about csv iterators.