If for your work you need to collect data from Instagram business profiles, you probably used a mobile application for it. You were forced to do it because there were no some business data in the web version. In particular, it was impossible to determine if you are looking at the business profile or personal. Now it’s possible to process them automatically with a web scraper using the mobile API. We found this solution on the Internet, one of our users wrote it and shared it with the community on one of the popular Internet marketing resources. Let’s examine how the web scraper works.
Article updated on 11.17.2019 due to changes to the API
To use the web scraper, you must specify the login and password for your Instagram account, and the list of accounts you want to collect business information about. Bear in mind that Instagram can block your account if using this web scraper may violate the TOS, so use it at your own risk, we are publishing it just for educational purposes. Below is the actual web scraper code:
---
config:
agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
debug: 2
do:
- variable_set:
field: username
value: YOUR INSTAGRAM ACCOUNT USERNAME
- variable_set:
field: password
value: YOUR INSTAGRAM ACCOUNT PASSWORD
- variable_set:
field: accounts
value: LIST OF INSTAGRAM ACCOUNTS, COMMA SEPARATED, YOUR WANT TO EXTRACT BUSINESS DATA ABOUT
- walk:
to: https://www.instagram.com/
do:
- find:
path: body
do:
- parse:
filter: window\._sharedData\s+\=\s+([^;]+);
- normalize:
routine: json2xml
- to_block
- find:
path: config>csrf_token
do:
- parse
- variable_set: token
- walk:
to:
post: https://www.instagram.com/accounts/login/ajax/
headers:
x-csrftoken: <%token%>
x-instagram-ajax: 1
x-requested-with: XMLHttpRequest
data:
username: <%username%>
password: <%password%>
do:
- find:
path: status
do:
- parse
- if:
match: "fail"
do:
- cannot_login_probably_checkpoint_is_required
- exit
- find:
path: authenticated
do:
- parse
- if:
match: "true"
else:
- wrong_login_or_password
- exit
- cookie_get: mid
- variable_set: mid
- cookie_get: rur
- variable_set: rur
- cookie_get: ds_user_id
- variable_set: dsuserid
- cookie_get: sessionid
- variable_set: sessionid
- variable_get: accounts
- to_block
- split:
context: text
delimiter: ','
- find:
path: div.splitted
do:
- parse
- space_dedupe
- trim
- variable_set: account
- agent_set: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
- walk:
to: https://www.instagram.com/<%account%>/
do:
- find:
path: script:contains("window._sharedData =")
do:
- parse
- space_dedupe
- trim
- filter:
args:
- window\._sharedData\s+\=\s+(.+)\s*;\s*$
- normalize:
routine: json2xml
- to_block
- find:
path: body_safe
do:
- find:
path: entry_data > profilepage > graphql > user > id
do:
- parse
- variable_set: id
- agent_set: Mozilla/5.0 (iPhone; CPU iPhone OS 12_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Instagram 105.0.0.11.118 (iPhone11,8; iOS 12_3_1; en_US; en-US; scale=2.00; 828x1792; 165586599)
- walk:
to: https://i.instagram.com/api/v1/users/<%id%>/info/
headers:
X-IG-App-ID: 567067343352427
X-IG-Capabilities: 3brDAw==
X-IG-Connection-Type: WIFI
X-IG-Connection-Speed: 3400
X-IG-Bandwidth-Speed-KBPS: -1.000
X-IG-Bandwidth-TotalBytes-B: 0
X-IG-Bandwidth-TotalTime-MS: 0
Cookie: mid=<%mid%>; csrftoken=<%token%>; rur=<%rur%>; ds_user_id=<%dsuserid%>; sessionid=<%sessionid%>; ig_or=;
X-FB-HTTP-Engine: Liger
Accept: '*/*'
Accept-Language: en-US
do:
- find:
path: body_safe > user
do:
- object_new: item
- find:
path: address_street
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: address_street
- find:
path: category
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: category
- find:
path: city_name
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: city_name
- find:
path: contact_phone_number
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: contact_phone_number
- find:
path: external_url
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: external_url
- find:
path: full_name
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: full_name
- find:
path: is_business
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: is_business
- find:
path: latitude
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: latitude
- find:
path: longitude
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: longitude
- find:
path: pk
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: id
- find:
path: public_email
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: public_email
- find:
path: public_phone_country_code
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: public_phone_country_code
- find:
path: public_phone_number
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: public_phone_number
- find:
path: username
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: username
- find:
path: zip
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: zip
- object_save:
name: item
- sleep: 5
As you probably already know, the config section is intended for presetting the scraper. In this case, to set the debug mode level (which is only required for development and could be omitted) and the browser name on behalf of which the web scraper sends requests to the server. Technically, there could be Chrome or Safari, but the author decided that there should be Firefox. By the way, sometimes the server can give different data, depending on the name of the browser. Also, sometimes it may be necessary to use a complete User-Agent string instead of a preset, they can be found here.
config:
agent: Firefox
debug: 2
The main logic block of the scraper is located in the section do. At the very beginning, the variables are initialized with your login, the password for the Instagram and the account list you want to extract:
- variable_set:
field: username
value: YOUR INSTAGRAM ACCOUNT USERNAME
- variable_set:
field: password
value: YOUR INSTAGRAM ACCOUNT PASSWORD
- variable_set:
field: accounts
value: LIST OF INSTAGRAM ACCOUNTS, COMMA SEPARATED, YOUR WANT TO EXTRACT BUSINESS DATA ABOUT
Next, the web scraper loads the Instagram homepage and goes into the body tag.
- walk:
to: https://www.instagram.com/
do:
- find:
path: body
do:
It parses all the text and extracts the Javascript object, translates it into XML and turns it into a DOM block, and then switches to this context.
- parse:
filter: window\._sharedData\s+\=\s+([^;]+);
- normalize:
routine: json2xml
- to_block
Now in our context, there is an extracted Javascript object (JSON) as DOM, and we can walk through its elements as if it was a standard HTML page. So we find the config node and inside of it the csrf_token node, parse the content and extract the token that we need for the login to the Instagram. We save it to the token variable. Then we log in to Instagram using the token, username and password, which we are already keeping in variables:
- find:
path: config>csrf_token
do:
- parse
- variable_set: token
- walk:
to:
post: https://www.instagram.com/accounts/login/ajax/
headers:
x-csrftoken: <%token%>
x-instagram-ajax: 1
x-requested-with: XMLHttpRequest
data:
username: <%username%>
password: <%password%>
Next, the scraper checks whether Instagram has authorized us.
- find:
path: status
do:
- parse
- if:
match: "fail"
do:
- cannot_login_probably_checkpoint_is_required
- exit
So if not, you will see an error and the scraper finishes the work. If you see this error in the log, try logging in through your browser and manually resolve the challenge. After that, you’ll be able to sign in to your account from web scraper. If the authorization is successful, the scraper will continue to work and transfer the necessary cookies to the variables to be able to use them in requests:
- find:
path: authenticated
do:
- parse
- if:
match: "true"
else:
- wrong_login_or_password
- exit
- cookie_get: mid
- variable_set: mid
- cookie_get: rur
- variable_set: rur
- cookie_get: ds_user_id
- variable_set: dsuserid
- cookie_get: sessionid
- variable_set: sessionid
Then the scraper reads the variable with the list of accounts into the register and converts the text in the register to the block and switches to this context. It is done to use the command split since the command works with the contents of the block, not the register. After splitting, the scraper iterates through each account and executes commands in the do block:
- split:
context: text
delimiter: ','
- find:
path: div.splitted
do:
All that happens next applies to every account listed in the CSV string you passed. The scraper parses the block which contains the account name, clears it of extra spaces and writes it to the variable so it can be used in requests.
- parse
- space_dedupe
- trim
- variable_set: account
The scraper takes the page of the channel to extract the channel ID because we need channel ID to call the mobile API. The ID is stored in the variable.
- walk:
to: https://www.instagram.com/<%account%>/
do:
- find:
path: script:contains("window._sharedData")
do:
- parse
- space_dedupe
- trim
- filter:
args:
- window\._sharedData\s+\=\s+(.+)\s*;\s*$
- normalize:
routine: json2xml
- to_block
- find:
path: body_safe
do:
- find:
path: entry_data > profilepage > graphql > user > id
do:
- parse
- variable_set: id
After that, a request is made to the Instagram mobile API. As we can see, the web scraper is masking for mobile application, using specific request headers.
- walk:
to: https://i.instagram.com/api/v1/users/<%id%>/info/
headers:
X-IG-App-ID: 567067343352427
X-IG-Capabilities: 3brDAw==
X-IG-Connection-Type: WIFI
X-IG-Connection-Speed: 3400
X-IG-Bandwidth-Speed-KBPS: -1.000
X-IG-Bandwidth-TotalBytes-B: 0
X-IG-Bandwidth-TotalTime-MS: 0
Cookie: mid=<%mid%>; csrftoken=<%token%>; rur=<%rur%>; ds_user_id=<%dsuserid%>; sessionid=<%sessionid%>; ig_or=;
X-FB-HTTP-Engine: Liger
Accept: '*/*'
Accept-Language: en-US
do:
The mobile API returns a response in JSON format. Diggernaut automatically converts it to XML and lets you work with the DOM structure using the standard find command. So all further code picks up the data using specific CSS selectors and saves them to the data object.
- object_new: item
- find:
path: biography
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: biography
- find:
path: follower_count
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: follower_count
- find:
path: address_street
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: address_street
- find:
path: category
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: category
- find:
path: city_name
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: city_name
- find:
path: contact_phone_number
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: contact_phone_number
- find:
path: external_url
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: external_url
- find:
path: full_name
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: full_name
- find:
path: is_business
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: is_business
- find:
path: latitude
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: latitude
- find:
path: longitude
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: longitude
- find:
path: pk
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: id
- find:
path: public_email
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: public_email
- find:
path: public_phone_country_code
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: public_phone_country_code
- find:
path: public_phone_number
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: public_phone_number
- find:
path: username
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: username
- find:
path: zip
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: item
field: zip
- object_save:
name: item
In general, we think, the logic of the web scraper is simple, there is the only complicated point in the process of masking for a mobile application. An example of the data obtained is given below:
{
item : {
category : "Product/Service",
username : "adidas",
is_business : "true",
contact_phone_number : "",
zip : "91074",
public_phone_number : "",
longitude : "10.9094251",
latitude : "49.5831932",
public_phone_country_code : "",
full_name : "adidas",
city_name : "Herzogenaurach",
address_street : "Adi-Dassler-Str. 1",
id : "20269764",
public_email : "",
external_url : "http://a.did.as/BuiltToDefy"
}
}
James Farrell, thank you for your blog post.Really thank you! Awesome.
love this post! But you should try https://github.com/LevPasha/Instagram-API-python
A kind of API made by LevPasha for non approved sandbox user and this api is better than the endpoints of approved sandbox user. So check it out!
Awesome!! I’m programming a chrome extension that enhances the web version of Instagram, the only thing that I got missing was the business account info, I didn’t know how to extract it. Thank you very much, thanks to this code I could find out the web endpoint that serves that data.
You saved my life =D
Thanks for sharing this!
However, Instagram recently made a modification in their API, which prevents us from scraping bios now.
The issue is when it goes to https://i.instagram.com/api/v1/users//info/ – it returns this now: {“message”: “useragent mismatch”, “status”: “fail”}
Would you know how to go around that?
Instagram requires all API calls to be originated from the official user agent, we modified the scraper code, it should work again now.