Mikhail Sisin in Learning meta-languageWeb scraping

How to collect data from Instagram business profiles

If for your work you need to collect data from Instagram business profiles, you probably used a mobile application for it. You were forced to do it because there were no some business data in the web version. In particular, it was impossible to determine if you are looking at the business profile or personal. Now it’s possible to process them automatically with a web scraper using the mobile API. We found this solution on the Internet, one of our users wrote it and shared it with the community on one of the popular Internet marketing resources. Let’s examine how the web scraper works.

Article updated on 11.17.2019 due to changes to the API

To use the web scraper, you must specify the login and password for your Instagram account, and the list of accounts you want to collect business information about. Bear in mind that Instagram can block your account if using this web scraper may violate the TOS, so use it at your own risk, we are publishing it just for educational purposes. Below is the actual web scraper code:

---
config:
    agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
    debug: 2
do:
- variable_set:
    field: username
    value: YOUR INSTAGRAM ACCOUNT USERNAME
- variable_set:
    field: password
    value: YOUR INSTAGRAM ACCOUNT PASSWORD
- variable_set:
    field: accounts
    value: LIST OF INSTAGRAM ACCOUNTS, COMMA SEPARATED, YOUR WANT TO EXTRACT BUSINESS DATA ABOUT
- walk:
    to: https://www.instagram.com/
    do:
    - find:
        path: body
        do:
        - parse:
            filter: window\._sharedData\s+\=\s+([^;]+);
        - normalize:
            routine: json2xml
        - to_block
        - find:
            path: config>csrf_token
            do:
            - parse
            - variable_set: token
        - walk:
            to:
                post: https://www.instagram.com/accounts/login/ajax/
                headers:
                    x-csrftoken: 
                    x-instagram-ajax: 1
                    x-requested-with: XMLHttpRequest
                data:
                    username: 
                    password: 
            do:
            - find:
                path: status
                do:
                - parse
                - if:
                    match: "fail"
                    do:
                    - cannot_login_probably_checkpoint_is_required
                    - exit
            - find:
                path: authenticated
                do:
                - parse
                - if:
                    match: "true"
                    else:
                    - wrong_login_or_password
                    - exit
                - cookie_get: mid
                - variable_set: mid
                - cookie_get: rur
                - variable_set: rur
                - cookie_get: ds_user_id
                - variable_set: dsuserid
                - cookie_get: sessionid
                - variable_set: sessionid
                - variable_get: accounts
                - to_block
                - split:
                    context: text
                    delimiter: ','
                - find:
                    path: div.splitted
                    do:
                    - parse
                    - space_dedupe
                    - trim
                    - variable_set: account
                    - agent_set: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
                    - walk:
                        to: https://www.instagram.com//
                        do:
                        - find:
                            path: script:contains("window._sharedData =")
                            do:
                            - parse
                            - space_dedupe
                            - trim
                            - filter: 
                                args:
                                    - window\._sharedData\s+\=\s+(.+)\s*;\s*$
                            - normalize:
                                routine: json2xml
                            - to_block
                            - find: 
                                path: body_safe 
                                do: 
                            - find:
                                path: entry_data > profilepage > graphql > user > id
                                do:
                                - parse
                                - variable_set: id
                                - agent_set: Mozilla/5.0 (iPhone; CPU iPhone OS 12_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Instagram 105.0.0.11.118 (iPhone11,8; iOS 12_3_1; en_US; en-US; scale=2.00; 828x1792; 165586599)
                                - walk:
                                    to: https://i.instagram.com/api/v1/users//info/
                                    headers:
                                        X-IG-App-ID: 567067343352427
                                        X-IG-Capabilities: 3brDAw==
                                        X-IG-Connection-Type: WIFI
                                        X-IG-Connection-Speed: 3400
                                        X-IG-Bandwidth-Speed-KBPS: -1.000
                                        X-IG-Bandwidth-TotalBytes-B: 0
                                        X-IG-Bandwidth-TotalTime-MS: 0
                                        Cookie: mid=; csrftoken=; rur=; ds_user_id=; sessionid=; ig_or=;
                                        X-FB-HTTP-Engine: Liger
                                        Accept: '*/*'
                                        Accept-Language: en-US
                                    do:
                                    - find:
                                        path: body_safe > user
                                        do:
                                        - object_new: item
                                        - find:
                                            path: address_street
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: address_street
                                        - find:
                                            path: category
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: category
                                        - find:
                                            path: city_name
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: city_name
                                        - find:
                                            path: contact_phone_number
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: contact_phone_number
                                        - find:
                                            path: external_url
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: external_url
                                        - find:
                                            path: full_name
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: full_name
                                        - find:
                                            path: is_business
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: is_business
                                        - find:
                                            path: latitude
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: latitude
                                        - find:
                                            path: longitude
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: longitude
                                        - find:
                                            path: pk
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: id
                                        - find:
                                            path: public_email
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: public_email
                                        - find:
                                            path: public_phone_country_code
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: public_phone_country_code
                                        - find:
                                            path: public_phone_number
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: public_phone_number
                                        - find:
                                            path: username
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: username
                                        - find:
                                            path: zip
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: zip
                                        - object_save:
                                            name: item
                                        - sleep: 5

As you probably already know, the config section is intended for presetting the scraper. In this case, to set the debug mode level (which is only required for development and could be omitted) and the browser name on behalf of which the web scraper sends requests to the server. Technically, there could be Chrome or Safari, but the author decided that there should be Firefox. By the way, sometimes the server can give different data, depending on the name of the browser. Also, sometimes it may be necessary to use a complete User-Agent string instead of a preset, they can be found here.

config:
    agent: Firefox
    debug: 2

The main logic block of the scraper is located in the section do. At the very beginning, the variables are initialized with your login, the password for the Instagram and the account list you want to extract:

- variable_set:
    field: username
    value: YOUR INSTAGRAM ACCOUNT USERNAME
- variable_set:
    field: password
    value: YOUR INSTAGRAM ACCOUNT PASSWORD
- variable_set:
    field: accounts
    value: LIST OF INSTAGRAM ACCOUNTS, COMMA SEPARATED, YOUR WANT TO EXTRACT BUSINESS DATA ABOUT

Next, the web scraper loads the Instagram homepage and goes into the body tag.

- walk:
    to: https://www.instagram.com/
    do:
    - find:
        path: body
        do:

It parses all the text and extracts the Javascript object, translates it into XML and turns it into a DOM block, and then switches to this context.

        - parse:
            filter: window\._sharedData\s+\=\s+([^;]+);
        - normalize:
            routine: json2xml
        - to_block

Now in our context, there is an extracted Javascript object (JSON) as DOM, and we can walk through its elements as if it was a standard HTML page. So we find the config node and inside of it the csrf_token node, parse the content and extract the token that we need for the login to the Instagram. We save it to the token variable. Then we log in to Instagram using the token, username and password, which we are already keeping in variables:

        - find:
            path: config>csrf_token
            do:
            - parse
            - variable_set: token
        - walk:
            to:
                post: https://www.instagram.com/accounts/login/ajax/
                headers:
                    x-csrftoken: 
                    x-instagram-ajax: 1
                    x-requested-with: XMLHttpRequest
                data:
                    username: 
                    password:

Next, the scraper checks whether Instagram has authorized us.

            - find:
                path: status
                do:
                - parse
                - if:
                    match: "fail"
                    do:
                    - cannot_login_probably_checkpoint_is_required
                    - exit

So if not, you will see an error and the scraper finishes the work. If you see this error in the log, try logging in through your browser and manually resolve the challenge. After that, you’ll be able to sign in to your account from web scraper. If the authorization is successful, the scraper will continue to work and transfer the necessary cookies to the variables to be able to use them in requests:

            - find:
                path: authenticated
                do:
                - parse
                - if:
                    match: "true"
                    else:
                    - wrong_login_or_password
                    - exit
                - cookie_get: mid
                - variable_set: mid
                - cookie_get: rur
                - variable_set: rur
                - cookie_get: ds_user_id
                - variable_set: dsuserid
                - cookie_get: sessionid
                - variable_set: sessionid

Then the scraper reads the variable with the list of accounts into the register and converts the text in the register to the block and switches to this context. It is done to use the command split since the command works with the contents of the block, not the register. After splitting, the scraper iterates through each account and executes commands in the do block:

                - split:
                    context: text
                    delimiter: ','
                - find:
                    path: div.splitted
                    do:

All that happens next applies to every account listed in the CSV string you passed. The scraper parses the block which contains the account name, clears it of extra spaces and writes it to the variable so it can be used in requests.

                    - parse
                    - space_dedupe
                    - trim
                    - variable_set: account

The scraper takes the page of the channel to extract the channel ID because we need channel ID to call the mobile API. The ID is stored in the variable.

                    - walk:
                        to: https://www.instagram.com//
                        do:
                        - find:
                            path: script:contains("window._sharedData")
                            do:
                            - parse
                            - space_dedupe
                            - trim
                            - filter: 
                                args:
                                    - window\._sharedData\s+\=\s+(.+)\s*;\s*$
                            - normalize:
                                routine: json2xml
                            - to_block
                            - find: 
                                path: body_safe 
                                do: 
                            - find:
                                path: entry_data > profilepage > graphql > user > id
                                do:
                                - parse
                                - variable_set: id

After that, a request is made to the Instagram mobile API. As we can see, the web scraper is masking for mobile application, using specific request headers.

                                - walk:
                                    to: https://i.instagram.com/api/v1/users//info/
                                    headers:
                                        X-IG-App-ID: 567067343352427
                                        X-IG-Capabilities: 3brDAw==
                                        X-IG-Connection-Type: WIFI
                                        X-IG-Connection-Speed: 3400
                                        X-IG-Bandwidth-Speed-KBPS: -1.000
                                        X-IG-Bandwidth-TotalBytes-B: 0
                                        X-IG-Bandwidth-TotalTime-MS: 0
                                        Cookie: mid=; csrftoken=; rur=; ds_user_id=; sessionid=; ig_or=;
                                        X-FB-HTTP-Engine: Liger
                                        Accept: '*/*'
                                        Accept-Language: en-US
                                    do:

The mobile API returns a response in JSON format. Diggernaut automatically converts it to XML and lets you work with the DOM structure using the standard find command. So all further code picks up the data using specific CSS selectors and saves them to the data object.

                                        - object_new: item
                                        - find:
                                            path: biography
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: biography
                                        - find:
                                            path: follower_count
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: follower_count
                                        - find:
                                            path: address_street
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: address_street
                                        - find:
                                            path: category
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: category
                                        - find:
                                            path: city_name
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: city_name
                                        - find:
                                            path: contact_phone_number
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: contact_phone_number
                                        - find:
                                            path: external_url
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: external_url
                                        - find:
                                            path: full_name
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: full_name
                                        - find:
                                            path: is_business
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: is_business
                                        - find:
                                            path: latitude
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: latitude
                                        - find:
                                            path: longitude
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: longitude
                                        - find:
                                            path: pk
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: id
                                        - find:
                                            path: public_email
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: public_email
                                        - find:
                                            path: public_phone_country_code
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: public_phone_country_code
                                        - find:
                                            path: public_phone_number
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: public_phone_number
                                        - find:
                                            path: username
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: username
                                        - find:
                                            path: zip
                                            do:
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: item
                                                field: zip
                                        - object_save:
                                            name: item

In general, we think, the logic of the web scraper is simple, there is the only complicated point in the process of masking for a mobile application. An example of the data obtained is given below:

{
  item : {
    category :  "Product/Service",
    username :  "adidas",
    is_business :  "true",
    contact_phone_number :  "",
    zip :  "91074",
    public_phone_number :  "",
    longitude :  "10.9094251",
    latitude :  "49.5831932",
    public_phone_country_code :  "",
    full_name :  "adidas",
    city_name :  "Herzogenaurach",
    address_street :  "Adi-Dassler-Str. 1",
    id :  "20269764",
    public_email :  "",
    external_url :  "http://a.did.as/BuiltToDefy"
  }
}

Next Read: Automated CloudFlare challenge solution with Golang »

Mikhail Sisin: Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

New in Diggernaut: expanded functionality to work with Selenium, new static variables, and proxy management
For paid subscribers, it became possible to set the proxy type for use in diggers.…
How to avoid getting detected during web scraping
It becomes fairly easy to gather information from the Internet with all the advanced scrapers.…
Learning how to scrape the data from eBay
eBay is a very famous and popular marketplace. Very often, it is used by small…