Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

Extract data from XLS, XLSX and CSV files

7 min read

Extract data from XLS, XLSX and CSV files

Today, the support for files in XLS, XLSX and CSV format has been added to the Diggernaut platform. The way it was implemented is same as for other supported file types. You load a file into the digger using the walk command, the digger gets the file, determines its type and converts it to XML. Next, you can traverse the DOM structure, extract the necessary data and create your data set.

Let’s see how exactly it works by an example. To do it, we uploaded 3 files to our sandbox:
https://www.diggernaut.com/sandbox/sample.csv – CSV data file
https://www.diggernaut.com/sandbox/sample.xls – XLS data file (binary version)
https://www.diggernaut.com/sandbox/sample.xlsx – XLSX data file (XML version)

We’ll code a straightforward digger configuration that loads the file and show us the source code of the converted data in debug mode.

---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: https://www.diggernaut.com/sandbox/sample.csv
    do:

If we run the digger in debug mode, then in the log we can see the following XML source page with data:

<html>
<head></head>
<body>
    <doc>
        <sheet name="default">
            <row class="1">
                <column class="1">First</column>
                <column class="2">Last</column>
                <column class="3">Pcode</column>
                <column class="4">Political Party</column>
            </row>
            <row class="2">
                <column class="1">Smith</column>
                <column class="2">Fred</column>
                <column class="3">A</column>
                <column class="4">Democratic</column>
            </row>
            <row class="3">
                <column class="1">Robbins</column>
                <column class="2">Terry</column>
                <column class="3">1</column>
                <column class="4">Green</column>
            </row>
            <row class="4">
                <column class="1">O'Neill</column>
                <column class="2">Susan</column>
                <column class="3">B</column>
                <column class="4">Republican</column>
            </row>
            <row class="5">
                <column class="1">Parker</column>
                <column class="2">Scott</column>
                <column class="3">D</column>
                <column class="4">American Independent</column>
            </row>
            <row class="6">
                <column class="1">Perkins</column>
                <column class="2">Ralph</column>
                <column class="3">D</column>
                <column class="4">American Independent</column>
            </row>
            <row class="7">
                <column class="1">Talbot</column>
                <column class="2">Angie</column>
                <column class="3">7</column>
                <column class="4">Middle Class Pty</column>
            </row>
        </sheet>
    </doc>
</body>
</html>

Since there is only one sheet in CSV, we have a single sheet element in the resulting structure. In XLS / XLSX, there can be many sheets, and all of them are kept in the corresponding sheet elements. It’s quite easy to parse this structure, go through the sheets, then go through the rows row and extract the data from the columns column. The values in the classes correspond to the row and column number in the original file.

Let’s now see how the XLS resource will be converted:

---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: https://www.diggernaut.com/sandbox/sample.xls
    do:

We get the following source code:

<html>
<head></head>
<body>
    <doc>
        <sheet name="Voters">
            <row class="1">
                <column class="1">First</column>
                <column class="2">Last</column>
                <column class="3">Pcode</column>
                <column class="4">Political Party</column>
            </row>
            <row class="2">
                <column class="1">Smith</column>
                <column class="2">Fred</column>
                <column class="3">A</column>
            </row>
            <row class="3">
                <column class="1">Robbins</column>
                <column class="2">Terry</column>
                <column class="3">1</column>
            </row>
            <row class="4">
                <column class="1">O'Neill</column>
                <column class="2">Susan</column>
                <column class="3">B</column>
            </row>
            <row class="5">
                <column class="1">Parker</column>
                <column class="2">Scott</column>
                <column class="3">D</column>
            </row>
            <row class="6">
                <column class="1">Perkins</column>
                <column class="2">Ralph</column>
                <column class="3">D</column>
            </row>
            <row class="7">
                <column class="1">Talbot</column>
                <column class="2">Angie</column>
                <column class="3">7</column>
            </row>
        </sheet>
        <sheet name="Party Codes">
            <row class="1">
                <column class="1">PARTY CODE</column>
                <column class="2">NAME</column>
            </row>
            <row class="2">
                <column class="1">1</column>
                <column class="2">Green</column>
            </row>
            <row class="3">
                <column class="1">2</column>
                <column class="2">Reform</column>
            </row>
            <row class="4">
                <column class="1">3</column>
                <column class="2">Whig</column>
            </row>
            <row class="5">
                <column class="1">4</column>
                <column class="2">Islamic Political Party of America</column>
            </row>
            <row class="6">
                <column class="1">5</column>
                <column class="2">Rock & Roll</column>
            </row>
            <row class="7">
                <column class="1">6</column>
                <column class="2">Natural Law</column>
            </row>
            <row class="8">
                <column class="1">7</column>
                <column class="2">Middle Class Pty</column>
            </row>
            <row class="9">
                <column class="1">8</column>
                <column class="2">Humanist</column>
            </row>
            <row class="10">
                <column class="1">9</column>
                <column class="2">Pragmatic</column>
            </row>
            <row class="11">
                <column class="1">10</column>
                <column class="2">Conscious American African Party</column>
            </row>
            <row class="12">
                <column class="1">11</column>
                <column class="2">Parliament Party</column>
            </row>
            <row class="13">
                <column class="1">12</column>
                <column class="2">United Conscious Builders of the Dream Party</column>
            </row>
            <row class="14">
                <column class="1">13</column>
                <column class="2">The Egalitarian Party</column>
            </row>
            <row class="15">
                <column class="1">14</column>
                <column class="2">The Humanitarian Party</column>
            </row>
            <row class="16">
                <column class="1">15</column>
                <column class="2">Scientifically Evolving University Party</column>
            </row>
            <row class="17">
                <column class="1">16</column>
                <column class="2">God, Truth & Love Party</column>
            </row>
            <row class="18">
                <column class="1">17</column>
                <column class="2">Superhappy Party</column>
            </row>
            <row class="19">
                <column class="1">18</column>
                <column class="2">Working Families Party</column>
            </row>
            <row class="20">
                <column class="1">A</column>
                <column class="2">Democratic</column>
            </row>
            <row class="21">
                <column class="1">B</column>
                <column class="2">Republican</column>
            </row>
            <row class="22">
                <column class="1">C</column>
                <column class="2">Decline to State</column>
            </row>
            <row class="23">
                <column class="1">D</column>
                <column class="2">American Independent</column>
            </row>
            <row class="24">
                <column class="1">E</column>
                <column class="2">Citizen Party</column>
            </row>
            <row class="25">
                <column class="1">F</column>
                <column class="2">Communist</column>
            </row>
            <row class="26">
                <column class="1">G</column>
                <column class="2">Conservative</column>
            </row>
            <row class="27">
                <column class="1">H</column>
                <column class="2">Environmentalist</column>
            </row>
            <row class="28">
                <column class="1">I</column>
                <column class="2">Ind. Progressive</column>
            </row>
            <row class="29">
                <column class="1">J</column>
                <column class="2">Liberal</column>
            </row>
            <row class="30">
                <column class="1">K</column>
                <column class="2">Peace & Freedom</column>
            </row>
            <row class="31">
                <column class="1">L</column>
                <column class="2">Prohibition</column>
            </row>
            <row class="32">
                <column class="1">M</column>
                <column class="2">New Economy</column>
            </row>
            <row class="33">
                <column class="1">N</column>
                <column class="2">Socialist</column>
            </row>
            <row class="34">
                <column class="1">O</column>
                <column class="2">Socialist Labor</column>
            </row>
            <row class="35">
                <column class="1">P</column>
                <column class="2">Pot Party</column>
            </row>
            <row class="36">
                <column class="1">Q</column>
                <column class="2">Libertarian</column>
            </row>
            <row class="37">
                <column class="1">R</column>
                <column class="2">Amer. Natl. Socialist</column>
            </row>
            <row class="38">
                <column class="1">S</column>
                <column class="2">Poor People’s Party</column>
            </row>
            <row class="39">
                <column class="1">T</column>
                <column class="2">Free</column>
            </row>
            <row class="40">
                <column class="1">U</column>
                <column class="2">National</column>
            </row>
            <row class="41">
                <column class="1">V</column>
                <column class="2">Constitution Party</column>
            </row>
            <row class="42">
                <column class="1">W</column>
                <column class="2">Vision</column>
            </row>
            <row class="43">
                <column class="1">X</column>
                <column class="2">Puritan</column>
            </row>
            <row class="44">
                <column class="1">Y</column>
                <column class="2">Federal</column>
            </row>
            <row class="45">
                <column class="1">Z</column>
                <column class="2">Misc.</column>
            </row>
            <row class="46">
                <column class="1"></column>
                <column class="2"></column>
            </row>
        </sheet>
    </doc>
</body>
</html>

As you can see, in this file we have 2 sheets, and the rest is the same structure as in the case of CSV. If we load XLSX, we get precisely the same result as with XLS, so we omit this test.

How can you use this functionality, except for the actual parsing of the final data? Alternatively, you can use spreadsheets as a feed with the resources your digger should scrape. For example, you add a list of links to products in the store to the sheet. Your scraper reads the sheet, picks up the list of URLs, puts them into the pool, and then the main logic of the scraper is used to collect the data about the goods. Alternatively, imagine that you have a spreadsheet with data that must be extended with the data from the web. Your scraper reads the sheet, go through it line by line and form a new dataset, for each line it can visit some page and extract some additional information to keep in the new dataset. This way you will have data from the spreadsheet and the product page merged in a single entry. There are other options for using the spreadsheets, but we can talk about it next time.

Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

One Reply to “Extract data from XLS, XLSX and CSV files”

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.