Extract data from iCal? It couldn’t be easier.
Today we write a script for scraping various resources which uses files in an iCal format to send event data.
Apple invented this format, and now many websites let you export the calendar events in this format. In this case, you do not need to scrape the website and parse the HTML. You only need to get and parse a file in iCal format. It makes the whole process much more manageable.
Diggernaut.com natively support this format and automatically converts it to XML. So we can work with iCal data as with a regular HTML page.
Let’s see how it works by extracting data from Science Fiction Conventions calendar, found by me on the icalshare.com website. Let’s start to write the config by defining some basic settings. First, we need to set a digger to the debug level 2. This is the only way we can see the source code of the converted file, and we need to check it so we could write the navigation instructions for walking to blocks with data we need to extract and collect.
Calendar file we are going to use is: https://www.google.com/calendar/ical/lirleni%40gmail.com/public/basic.ics
So, our config starts with:
---
config:
debug: 2
agent: Firefox
do:
- walk:
to: https://www.google.com/calendar/ical/lirleni%40gmail.com/public/basic.ics
do:
- stop
In this code we set Debug level 2, configure digger to use Firefox as browser name and get iCal file. Now we need to login to our account at Diggernaut.com, select existing or create a new project and then create new digger, and save config we wrote to the digger.
Make sure that digger is set to the Debug mode (in the Status column you should see Debug). If it’s not so, you need to switch digger to debug mode using the selector in the Status column. Then we need to start the digger and wait until it finishes the run. When it’s done, we need to check logs by clicking on the “Log” button.
<html>
<head></head>
<body>
<body_safe>
<event>
<alarmtime>0s</alarmtime>
<class>PUBLIC</class>
<created>2009-03-07 20:15:36 +0000 UTC</created>
<description>Gaming Convention</description>
<end>2011-01-31 00:00:00 +0000 UTC</end>
<id>2a994c3e3b80f5af6e8fa178a3af45d4</id>
<importedid>b82f342c-0b54-11de-b762-000d936372a6</importedid>
<location>Champaign IL (USA)</location>
<modified>2011-02-20 23:29:49 +0000 UTC</modified>
<rrule></rrule>
<sequence>4</sequence>
<start>2011-01-28 00:00:00 +0000 UTC</start>
<status>CONFIRMED</status>
<summary>WinterWar 38</summary>
<wholedauyevent>true</wholedauyevent>
</event>
<event>
<alarmtime>0s</alarmtime>
<class></class>
<created>2009-01-10 17:33:11 +0000 UTC</created>
<description>Gaming convention</description>
<end>2011-08-08 00:00:00 +0000 UTC</end>
<id>5c8f16d772ede097822e73a0c2e51c6c</id>
<importedid>356F0F0C-FE52-47A8-AEAB-8E78F57D4F52</importedid>
<location>Indianapolis IN</location>
<modified>2011-02-20 23:29:48 +0000 UTC</modified>
<rrule></rrule>
<sequence>10</sequence>
<start>2011-08-04 00:00:00 +0000 UTC</start>
<status>CONFIRMED</status>
<summary>GenCon</summary>
<wholedauyevent>true</wholedauyevent>
</event> ...
As you can see, page structure consists of
<event>
<alarmtime>0s</alarmtime>
<class>PUBLIC</class>
<created>2009-03-07 20:15:36 +0000 UTC</created>
<description>Gaming Convention</description>
<end>2011-01-31 00:00:00 +0000 UTC</end>
<id>2a994c3e3b80f5af6e8fa178a3af45d4</id>
<importedid>b82f342c-0b54-11de-b762-000d936372a6</importedid>
<location>Champaign IL (USA)</location>
<modified>2011-02-20 23:29:49 +0000 UTC</modified>
<rrule></rrule>
<sequence>4</sequence>
<start>2011-01-28 00:00:00 +0000 UTC</start>
<status>CONFIRMED</status>
<summary>WinterWar 38</summary>
<wholedauyevent>true</wholedauyevent>
</event>
We are not going to pick all these fields, let’s get only summary, description, start date/time, end date/time and location. It’s very easy to do: first, we walk to the event block, create a data object, then we walk to the fields blocks, parse data and save it to the object fields and finally save the data object.
---
config:
agent: Firefox
do:
- walk:
to: https://www.google.com/calendar/ical/lirleni%40gmail.com/public/basic.ics
do:
- find:
path: event
do:
- object_new: event
- find:
path: summary
do:
- parse
- normalize:
routine: replace_substring
args:
\\: ''
- object_field_set:
object: event
field: summary
- find:
path: description
do:
- parse
- normalize:
routine: replace_substring
args:
\\: ''
- object_field_set:
object: event
field: description
- find:
path: location
do:
- parse
- normalize:
routine: replace_substring
args:
\\: ''
- object_field_set:
object: event
field: location
- find:
path: start
do:
- parse:
filter: (\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})
- object_field_set:
object: event
field: start_date
- find:
path: end
do:
- parse:
filter: (\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})
- object_field_set:
object: event
field: end_date
- object_save:
name: event
Lets put our config to the digger and run it. Once it’s done, let’s jump to the Data section and make sure that data we have scraped is in a good state. You should see there something like:
Item #1
start_date 2011-01-28 00:00:00
summary WinterWar 38
description Gaming Convention
end_date 2011-01-31 00:00:00
location Champaign IL (USA)
Item #2
start_date 2011-08-04 00:00:00
summary GenCon
description Gaming convention
end_date 2011-08-08 00:00:00
location Indianapolis IN
Item #3
start_date 2011-03-11 00:00:00
summary Madicon 20
description Science Fiction, with a large proportion of Gaming
end_date 2011-03-14 00:00:00
location Harrisonburg VA, USA
If data is good, let’s switch our digger to the Active mode, as in Debug mode you cannot download data, all you can do in Debug mode is to review a limited set of data. Let’s start digger again and wait for completion. Then go to the Data section again and download data in the format we need. The sample in XLSX format you can download here.
As you can see, it’s straightforward to work with iCal at Diggernaut!