There is a new version of Surf library for Golang has been pushed. This version can bypass a fresh version of CloudFlare protection. We are using this library in our engine so our users can feel all benefits. Library bypass protection in automated mode, so you don’t need to do anything extra. You are just loading a page as usual, and if there is CloudFlare challenge, library resolves it automatically it, and you get content of the page you requested.
You are free to use Surf library from our repo for your projects, it’s under MIT license and is forked from headzoo/surf. However, we are using own version that fit needs of our web scraping engine.
How to test if it works. You can try to load some page which is under protection. This site is under CloudFlare. Let’s try to use following digger config to get this page and extract website URL:
---
config:
debug: 2
agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36
do:
- walk:
to: https://www.g2crowd.com/products/essbase/details
do:
- find:
path: div.company-info
do:
- object_new: item
- find:
path: dl > dt:contains("Vendor") + dd
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: vendor
- find:
path: dl > dt:contains("Description") + dd
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: description
- find:
path: dl > dt:contains("Company Website") + dd>a
do:
- parse:
attr: href
- space_dedupe
- trim
- object_field_set:
object: item
field: website
- object_save:
name: item
Data we get will looks like:
{
item : {
website : "https://www.oracle.com/index.html",
vendor : "Oracle",
description : "Oracle Corporation develops, manufactures, markets, hosts, and supports database and middleware software, applications software, and hardware systems."
}
}