Evgeniy Solomanidin in CodeProjectDiggernaut EngineGoGolangProgrammingWeb scraping

Json to XML, or “transform in 6 seconds.”

Hi folks. I want to share with you some details about our engine. As you know, it is written in Go. We use a lot of libraries there, and one of them – mxj – an outstanding library to work with XML.

Now I am going to briefly tell you how our engine’s json2xml routine works. First, we convert json to the map [string] interface {}, and then feed this object to mxj following way: xmlValue, err: = mxj.AnyXmlIndent (data, "", "", "body"). After it, we fix the self-closed tags and pass the object. We used this logic for 3 months, and everything was just fine, but suddenly it comes that we need to parse larger volumes of json than usual. So it turned out to be a problem. One of the diggers works 8 hours instead of 15 minutes. So we did the necessary research. Page processing takes 16 minutes, which, for obvious reasons, is unacceptable. It turned out that there is 2.5 MB of json. Processing takes about 3 minutes using mxj library, and then some magic happened – the engine went crazy, and it took 13 minutes to process XML. Of course, we were not happy with it, and we decided to improve mxj first.

mxj library problem lay in the fact that it uses a string concatenation. Everyone knows that the strings in Golang are immutable, respectively, each such operation allocates memory for the old string and a new string. We decided to get around and have written a few new functions, which uses bytes.Buffer instead of strings. Only by this simple change, we were able to speed up XML processing in mxj library by about 180 times. Now it takes less than 1 second to process the same set of data we used before, so we made it from 3 min to 1 sec.

During further research we found were we made a mistake, our engine expects HTML and when we are working with JSON, it may come up that some self-closed HTML tags (like img or area etc.) are used in XML as standard tags and it caused problems, so we made another change to the library that allowed us to replace some tags with safe versions. It solved all the issues we had, and the page that previously took 15 min to process now takes just 6 sec.

Repository with the library we modified can be found here.

As a bonus, we wrote a simple converter that allows you to load data from MongoDB and convert it toXML. You can get it here.

Next Read: What is OCR? - Diggernaut can help! »

Evgeniy Solomanidin:

New in Diggernaut: expanded functionality to work with Selenium, new static variables, and proxy management
For paid subscribers, it became possible to set the proxy type for use in diggers.…
How to avoid getting detected during web scraping
It becomes fairly easy to gather information from the Internet with all the advanced scrapers.…
Learning how to scrape the data from eBay
eBay is a very famous and popular marketplace. Very often, it is used by small…