Geospatial Data
Working with addresses
When we collect information from websites, it often happens that the addresses of companies or objects are written in free form or in single container. In such cases, it may take serious efforts to split an address into parts: house number, street, apartment, city, area, zip code and country. To simplify the job, we have implemented support for a well-known library for address parsing, which is called libPostal. This library is written in C and uses statistical NLP together with open data sets from OSM and OpenAddresses to normalize and parse addresses around the globe.
To parse postal address you can use the address_parse command, which supports following parameters:
Parameter | Description |
---|---|
address | Postal address you need to parse. |
Example of address parsing.
# SWITCHING TO THE GEO CONTEXT
- geo:
do:
# Parse address
- address_parse:
address: 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
do:
Time | Level | Message |
---|---|---|
2018-07-11 21:05:25:806 | info | Scrape is done |
2018-07-11 21:05:25:792 | debug | Page content: ... |
2018-07-11 21:05:24:760 | info | Retrieving page (POST/JSON): https://geo.diggernaut.net/parse |
2018-07-11 21:05:24:752 | debug | Address: 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA |
2018-07-11 21:05:24:739 | debug | Parsing address |
2018-07-11 21:05:24:728 | info | Starting scrape |
2018-07-11 21:05:24:690 | debug | Setting up surf |
2018-07-11 21:05:24:657 | info | Starting digger: OSM test [2794] |
<html>
<head></head>
<body>
<body_safe>
<body_safe>
<components>
<label>house_number</label>
<value>781</value>
</components>
<components>
<label>road</label>
<value>franklin ave</value>
</components>
<components>
<label>suburb</label>
<value>crown heights</value>
</components>
<components>
<label>city_district</label>
<value>brooklyn</value>
</components>
<components>
<label>city</label>
<value>nyc</value>
</components>
<components>
<label>state</label>
<value>ny</value>
</components>
<components>
<label>postcode</label>
<value>11216</value>
</components>
<components>
<label>country</label>
<value>usa</value>
</components>
<status>success</status>
</body_safe>
</body_safe>
</body>
</html>
Example of address normalization.
# SWITCHING TO THE GEO CONTEXT
- geo:
provider: osm
do:
# Normalize address
- address_expand:
address: One-hundred twenty E 96th St
do:
Time | Level | Message |
---|---|---|
2018-07-12 01:58:42:548 | info | Scrape is done |
2018-07-12 01:58:42:530 | debug | Page content: ... |
2018-07-12 01:58:41:317 | info | Retrieving page (POST/JSON): https://geo.diggernaut.net/expand |
2018-07-12 01:58:41:309 | debug | Address: One-hundred twenty E 96th St |
2018-07-12 01:58:41:301 | debug | Normalizing address |
2018-07-12 01:58:41:293 | info | Starting scrape |
2018-07-12 01:58:41:253 | debug | Setting up surf |
2018-07-12 01:58:41:221 | info | Starting digger: OSM test [2794] |
<html>
<head></head>
<body>
<body_safe>
<body_safe>
<expansions>120 east 96th saint</expansions>
<expansions>120 east 96th street</expansions>
<expansions>120 e 96th saint</expansions>
<expansions>120 e 96th street</expansions>
<expansions>120 east 96 saint</expansions>
<expansions>120 east 96 street</expansions>
<expansions>120 e 96 saint</expansions>
<expansions>120 e 96 street</expansions>
<status>success</status>
</body_safe>
</body_safe>
</body>
</html>
Next we will learn about a number of complimentary commands that can be useful in different situations.