Mikhail Sisin in Learning meta-languageWeb scraping

Extract data from iCal? It couldn’t be easier.

Today we write a script for scraping various resources which uses files in an iCal format to send event data.

Apple invented this format, and now many websites let you export the calendar events in this format. In this case, you do not need to scrape the website and parse the HTML. You only need to get and parse a file in iCal format. It makes the whole process much more manageable.

Diggernaut.com natively support this format and automatically converts it to XML. So we can work with iCal data as with a regular HTML page.

Let’s see how it works by extracting data from Science Fiction Conventions calendar, found by me on the icalshare.com website. Let’s start to write the config by defining some basic settings. First, we need to set a digger to the debug level 2. This is the only way we can see the source code of the converted file, and we need to check it so we could write the navigation instructions for walking to blocks with data we need to extract and collect.

Calendar file we are going to use is: https://www.google.com/calendar/ical/lirleni%40gmail.com/public/basic.ics

So, our config starts with:

---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: https://www.google.com/calendar/ical/lirleni%40gmail.com/public/basic.ics
    do:
    - stop

In this code we set Debug level 2, configure digger to use Firefox as browser name and get iCal file. Now we need to login to our account at Diggernaut.com, select existing or create a new project and then create new digger, and save config we wrote to the digger.

Make sure that digger is set to the Debug mode (in the Status column you should see Debug). If it’s not so, you need to switch digger to debug mode using the selector in the Status column. Then we need to start the digger and wait until it finishes the run. When it’s done, we need to check logs by clicking on the “Log” button.

<html>

<head></head>

<body>
    <body_safe>
        <event>
            <alarmtime>0s</alarmtime>
            <class>PUBLIC</class>
            <created>2009-03-07 20:15:36 +0000 UTC</created>
            <description>Gaming Convention</description>
            <end>2011-01-31 00:00:00 +0000 UTC</end>
            <id>2a994c3e3b80f5af6e8fa178a3af45d4</id>
            <importedid>b82f342c-0b54-11de-b762-000d936372a6</importedid>
            <location>Champaign IL (USA)</location>
            <modified>2011-02-20 23:29:49 +0000 UTC</modified>
            <rrule></rrule>
            <sequence>4</sequence>
            <start>2011-01-28 00:00:00 +0000 UTC</start>
            <status>CONFIRMED</status>
            <summary>WinterWar 38</summary>
            <wholedauyevent>true</wholedauyevent>
        </event>
        <event>
            <alarmtime>0s</alarmtime>
            <class></class>
            <created>2009-01-10 17:33:11 +0000 UTC</created>
            <description>Gaming convention</description>
            <end>2011-08-08 00:00:00 +0000 UTC</end>
            <id>5c8f16d772ede097822e73a0c2e51c6c</id>
            <importedid>356F0F0C-FE52-47A8-AEAB-8E78F57D4F52</importedid>
            <location>Indianapolis IN</location>
            <modified>2011-02-20 23:29:48 +0000 UTC</modified>
            <rrule></rrule>
            <sequence>10</sequence>
            <start>2011-08-04 00:00:00 +0000 UTC</start>
            <status>CONFIRMED</status>
            <summary>GenCon</summary>
            <wholedauyevent>true</wholedauyevent>
        </event> ...

As you can see, page structure consists of blocks. So all that we need is to go through all these blocks and pick all fields from each. So, let’s pick one block and reformat it so we could see better what we need to get as data fields and probably what filters to use.

<event>
    <alarmtime>0s</alarmtime>
    <class>PUBLIC</class>
    <created>2009-03-07 20:15:36 +0000 UTC</created>
    <description>Gaming Convention</description>
    <end>2011-01-31 00:00:00 +0000 UTC</end>
    <id>2a994c3e3b80f5af6e8fa178a3af45d4</id>
    <importedid>b82f342c-0b54-11de-b762-000d936372a6</importedid>
    <location>Champaign IL (USA)</location>
    <modified>2011-02-20 23:29:49 +0000 UTC</modified>
    <rrule></rrule>
    <sequence>4</sequence>
    <start>2011-01-28 00:00:00 +0000 UTC</start>
    <status>CONFIRMED</status>
    <summary>WinterWar 38</summary>
    <wholedauyevent>true</wholedauyevent>
</event>

We are not going to pick all these fields, let’s get only summary, description, start date/time, end date/time and location. It’s very easy to do: first, we walk to the event block, create a data object, then we walk to the fields blocks, parse data and save it to the object fields and finally save the data object.

---
config:
    agent: Firefox
do:
- walk:
    to: https://www.google.com/calendar/ical/lirleni%40gmail.com/public/basic.ics
    do:
    - find:
        path: event
        do:
        - object_new: event
        - find:
            path: summary
            do:
            - parse
            - normalize:
                routine: replace_substring
                args:
                    \\: ''
            - object_field_set:
                object: event
                field: summary
        - find:
            path: description
            do:
            - parse
            - normalize:
                routine: replace_substring
                args:
                    \\: ''
            - object_field_set:
                object: event
                field: description
        - find:
            path: location
            do:
            - parse
            - normalize:
                routine: replace_substring
                args:
                    \\: ''
            - object_field_set:
                object: event
                field: location
        - find:
            path: start
            do:
            - parse:
                filter: (\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})
            - object_field_set:
                object: event
                field: start_date
        - find:
            path: end
            do:
            - parse:
                filter: (\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})
            - object_field_set:
                object: event
                field: end_date
        - object_save:
            name: event

Lets put our config to the digger and run it. Once it’s done, let’s jump to the Data section and make sure that data we have scraped is in a good state. You should see there something like:

Item #1 
start_date  2011-01-28 00:00:00
summary WinterWar 38
description Gaming Convention
end_date    2011-01-31 00:00:00
location    Champaign IL (USA)
Item #2 
start_date  2011-08-04 00:00:00
summary GenCon
description Gaming convention
end_date    2011-08-08 00:00:00
location    Indianapolis IN
Item #3 
start_date  2011-03-11 00:00:00
summary Madicon 20
description Science Fiction, with a large proportion of Gaming
end_date    2011-03-14 00:00:00
location    Harrisonburg VA, USA

If data is good, let’s switch our digger to the Active mode, as in Debug mode you cannot download data, all you can do in Debug mode is to review a limited set of data. Let’s start digger again and wait for completion. Then go to the Data section again and download data in the format we need. The sample in XLSX format you can download here.

As you can see, it’s straightforward to work with iCal at Diggernaut!

Next Read: What to do when server respond with JSON? »

Mikhail Sisin: Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

New in Diggernaut: expanded functionality to work with Selenium, new static variables, and proxy management
For paid subscribers, it became possible to set the proxy type for use in diggers.…
How to avoid getting detected during web scraping
It becomes fairly easy to gather information from the Internet with all the advanced scrapers.…
Learning how to scrape the data from eBay
eBay is a very famous and popular marketplace. Very often, it is used by small…