What is web scrapping?

Web scrapping is the automation of web browser.

Yes, it is that simple, Web scrapping is the automation of web browser, where we write some codes and navigate some pages, click buttons, download reports.

We can use any programming language and some web browsers to do this automation. In this book we will using Excel VBA and Internet explorer.

Further in this book, we will use the term web automation instead of Web scrapping. In VBA web automation is very simple.

What are the certain steps to do web scrapping?

If We think navigating a website manually, we will notice that there are only few operations that We do. Those are below.

  1. We navigate a link to open a website.
  2. We click some button to navigate some other pages.
  3. We access some controls like Textboxes, Dropdowns, List Boxes etc.
  4. We click on hyperlinks.
  5. We download some files.
  6. We save the pages and pdf files etc.

There are some functions and procedures that we can use of internet explorer by using VBA to automate all these things. Let see them all by example.

What is DOM?

First, we need to understand the object model of Internet explorer, which is extremely simple.

The Basics

We should have basic knowledge of VBA to understand the code showing in this article. If you haven’t yet read my Quick Guide for VBA, Download it for free from here.

https://www.itchat.in/2018/08/09/vba-tutorial/

As We can see that Application is the parent object of all other object. So first, we need to create object of Internet Explorer Application.

How to create IE Object?

There are two ways to create objects in VBA

  1. Early Binding – For early binding, we will add reference of two libraries in our VBA project.

For creating object by early binding, we will use the code showing below

Public objIE As InternetExplorer

Sub InitializeIE()
 Set objIE = Nothing
 Set objIE = New InternetExplorer
End Sub

  1. Late Binding – For using late binding, we don’t need to add any references in VBA project. we will use createObject method to create object of internet explorer application.

Sub InitializeIE()
 Set objIE = Nothing
  Set objIE = CreateObject(“Internetexplorer.Application”)
End Sub

Code showing above will create an object of internet explorer that we will use to navigate our website and do other operations. When we create object by any of the methods, it creates it in the memory, we will need to make the object visible to the windows. We will use code below to make it visible.

ObjIE.Visible=True

Understanding Website pattern.

Now we have your Explorer ready and visible. Let’s try to automate the process of scrapping all blog post from https://www.itchat.in

Navigating a page – First of all we need to identify the first page URL we navigate in our visible internet explorer (it is not necessary to make IE visible though). In this case we have this URL.

https://www.itchat.in

Every website has its own structure, So, prior to writing a program, we need to study the website by navigating through its pages and see what address it is showing in the IE address bar.

Usually, when you hit a button, it will either navigate to another page or will run a JavaScript. We need to identify the pattern of what is happening when we click any link or button on a web page.

For example, if we navigate itchat.in and click on the page number at the bottom, we see that address bar is showing addresses like below –

Click on page 2 and address will be – https://www.itchat.in/page/2/

Click on page 3 and address will be – https://www.itchat.in/page/3/

And So on….

Now, we know the pattern to follow, but we don’t know how many pages are there. That is what we will scrap first before following this pattern in our program. Let’s see how we can do that.

Navigating Web pages

See how to navigate a page in the code showing below.

Sub DoAutomation ()
  InitializeIE
  objIE.navigate “https://www.itchat.in/
  WaitUntilReady
End Sub

Waiting on IE

Notice that we are calling a program WaitUntilReady in about program, we are calling this because when we navigate a page, browse take little time to load the page.

See the program showing below. This will run an empty loop and do nothing until the browser is ready. See the constant READYSTATE_COMPLETE, we can type this keyword only when we are doing early binding else it will be treated as an empty variable.

Sub WaitUntilReady ()
  Do: Loop Until objIE.readyState <> READYSTATE_COMPLETE
End Sub

If we are using late binding, we will replace this constant with 4.

Now, the important part is that sometime above function for wait don’t really work, because some of the elements just doesn’t get loaded, reason can be anything. E.g browser doesn’t support the element or internet speed is slower.

In that case we need to use our custom wait program, which is as below, in this program we pass the number of seconds we want to wait after navigating a page.

Sub WaitTime (blnSeconds As Byte)
  Dim strWaitTime As String
  strWaitTime = CDate (blnSeconds / 86400)
  Application.Wait Now + TimeValue(strWaitTime)
End Sub

Next step is to extract page numbers to navigate in a loop. Now let’s try understanding the ways we can get elements loaded on the webpage into our VBA program.

Setting HTML Document

First, we need a HTML document object. So that we can load entire page into an object. Best way is to declare this object as public variable, like we declare IE. See the bold lines in program below.

Public objIE As InternetExplorer

Public objDoc As HTMLDocument

Sub DoAutomation ()
  InitializeIE
  objIE.navigate “https://www.itchat.in/”
  WaitUntilReady
  Set objDoc = objIE.document
End Sub

Getting Elements

When we do Late binding, we will declare every object as Object.

Next step is to get the element, there are few ways we can get the element object from document object.

  1. GetElementByID()
  2. GetElementsByClassName()
  3. GetElementsByName()
  4. GetElementsByTagName()

Notice that, only first method is saying GetElement and rest other are saying GetElements, because there can only be one element with one ID. But there can be multiple elements with same name and in the same name class or tag name.

So, when we use the methods except first, it will return an array of elements and we need to identify which one we will use. Most of the time we can get elements by Class name. if website is really programmed in good manner, we can get element by ID too.

Let’s see how we can know the ID, Class, Name or Tag Name of any element.

Go to the IE and hit F12 or right click anywhere on the webpage and click inspect element.

It will open IE console and will show the document HTML script. That looks like this.

Notice that, it has multiple tabs. We are concern with only DOM Explorer. Also see that three-little button at the top left. these buttons will help us select the element we want to access.

Click on the first button and then move the cursor on the element we want to access and then click there. It will take us to the HTML Line of that element.

Have a look at the screenshot below, I click on the first button (Select Element Button) and then I click on the text Page 1 to 5 and it took me to the HTML code.

Here you will notice that it is showing class name “pagination-meta” this is what we will access in our VBA program.

Have a look at the program below.

.

.

Set objEle = objDoc.getElementsByClassName(“pagination-meta”)
lngPageNumber = Split(objEle(0).innerHTML, ” “)(3)

.

.

Elements properties

Notice that, we are using GetElementsByClassName(), it means it will return an array with all element, even it there is only one element in that class.

Look at the bold syntax in above code, we are accessing the first element of that array. We can use InnerHTML property to get the text.

InnerHTML property is not always used to get the text. So, when we are not sure about getting the text. We will use Watch Window.

Watch Window is the feature of VBE that let us add any object into it and it shows all property and their current value.

To add any object in Watch window, just right click on the object variable and select Add Watch.

Once object is added to the watch windows, we can now explore the properties here.

Let’s move forward to achieve our goal of scrapping all posts from the site. Now we know how many pages there are to navigate, we will loop from1 to Nth page number.

For lngR = 1 To lngPageNumber
  objIE.navigate “https://www.itchat.in/page/” & lngR
  WaitTime 3
  Set objDoc = objIE.document
Next lngR

Every time we navigate a page, we need to set the document to load current page in VBA. See the bold line in above program.

We have successfully written the program to load the pages, Now, we need to go through each post entry and populate them in excel. Let’s try to find out the element for the post entry.

We will use the same method. Click on the first button (Select Element) in IE Console and then click on any of the post entry. It will show the screen showing below.

It is very easy to spot which method we will use to get element, as we see the class name here, we will again use getElementsByClassName and it will return all element that is added to this class into an array

It means that array will have elements of all post entries.

Set objEle = objDoc.getElementsByClassName(“post-title entry-title”)
For Each objEles In objEle
  strHyperLink = objEles.innerHTML
  strHyperLink = Split(strHyperLink, “href=”)(1)
  strHyperLink = Split(strHyperLink, “rel=”)(0)
  strHyperLink = Replace(strHyperLink, “”””, “”)
  Sheet1.Range(“A” & lngIndex + 1).Value = objEles.innerText
  Sheet1.Range(“B” & lngIndex + 1).Hyperlinks.Add Sheet1.Range(“B” & lngIndex + 1), Trim(strHyperLink), , “”, “Click here”
  lngIndex = lngIndex + 1
Next objEles

Above code will loop through each post entries and populate cells in excel with entry name and the it’s hyperlink.

What we did above was the basic example of web scrapping, most of the time trick above will work for you.

Now, lets see how we can control the form elements on websites by VBA. Let’s try to fill a form and hit submit button to subscribe for the newsletter.

When we work with form on web application. There are good chances that we will get ID of the elements. In this example we will get the element by ID.

Populating form on website

We already know the first couple steps which are as follow –

  1. Create IE object and make it visible
  2. Navigate the site and wait until it is loaded, in this example website is “https://www.itchat.in/”
  3. Get HTML Document in an object.
  4. Then get the element in object, see below.

Sub PopulateForm()

  InitializeIE
  objIE.navigate “https://www.itchat.in/”
  WaitTime 5
  Set objDoc = objIE.document
  objDoc.getElementById(“avia_0_1”).innerText = InputBox(“Email”)
  objDoc.getElementById(“avia_1_1”).innerText = InputBox(“First Name”)
  objDoc.getElementById(“avia_2_1”).innerText = InputBox(“Last Name”)
  objDoc.getElementById(“avia_5_1”).innerText = InputBox(“Birthday”)
  objDoc.getElementsByClassName(“button”)(2).Click

End Sub

See the above program, that is how you can access the text box and populate values and then click a button, hard work in web scrapping is only to find the element and the right properties to read and write values to those elements.

Once our work is done, let’s not forget to close the IE windows and flush that out of memory. Use the following at the end of program.

objIE.Quit
set objIE = Nothing

Hope you have a good idea of web scrapping now.

Now, we will see some tips and trick that can make your life easy when you do web scrapping.

Copy Paste from web browser

Use this code to copy data from web browser and then you can paste it excel, this code will copy the HTML elements.

Sub SelectAllAndCopyPaste()

  InitializeIE
  objIE.navigate “https://www.itchat.in/”
  WaitTime 5
  objIE.ExecWB OLECMDID_SELECTALL, OLECMDEXECOPT_DODEFAULT ‘– 17
  objIE.ExecWB OLECMDID_COPY, OLECMDEXECOPT_DODEFAULT ‘– 12

End Sub

SaveAs Document

Sometime happen when browser is showing a pdf or document and we want to save that in our computer, we can use this code to download it.

Sub OpenSaveAsDialogBox()

  Dim strPath As String
  Dim lngWindowId

  strPath = “C:\Data”
  InitializeIE
  objIE.navigate “https://www.itchat.in/”
  WaitTime 5
  objIE.ExecWB OLECMDID_SAVEAS, OLECMDEXECOPT_DODEFAULT
  WaitTime 1
  SendKeys strPath
  WaitTime 1
  SendKeys “{ENTER}”

End Sub

Running a JavaScript

Sometimes, we have a button that is running a javascript when we click on it. JavaScript take input from the form object element and process the information. It doesn’t always create a link so that we can navigate. So, we need to run this javascript by our VBA script. This is how we do it.

objIE.Document.parentWindow.execScript “JavaScriptGoeshere”

we can find this JavaScript function name in the Document object by same method we find class name etc.

we hope you get a good idea of web automation. We are open for any query regarding VBA. You can contact @ excel-consulting@itchat.in

We tried to make program as correct as possible. But in case you find anything not working. You are most welcome to connect with us.

If you still haven’t joined our awesome Excel Power User WhatsApp Group, you are most welcome, follow the link below to join

https://www.itchat.in/2017/10/23/excel-vba-whatsapp-group/

Join Our Premium Group

Because learning right is important 

Download Quick Guide for VBA Free!!