Casperjs for crawling and scraping Thailand Yellow Page

I was looking for website that list all the company contacts in Thailand for email marketing. My friend introduced me to Thai Yellow Page Website for this. After I have searched for some companies in certain category, it showed many company list with duplicated results…

http://www.yellowpages.co.th/en/ypsearch?q=Hospitals&w=Bangkok

So my thinking was, how can I get all this data to Couchdb so I can do view search and reduce the results…

Casperjs (headless browser for crawling) does good work for crawling the data and overcoming any JavaScript challenges. Way the Casperjs run is similar to test suite, where you lay down the process (code) and let it run to see the results. I think some people use it for A2A Testing but as far as I know for Angularjs developments Protractor with Selenium WebDriver is more suitable for this. If go on about A2A Testing people usually goes with the one popular to that Framework so my understanding and usage of Casperjs is basically crawling and scraping data like real user.

Scraping one page data is not too difficult (usage of API is well documented inside website), so if you like JavaScript should not be much problem getting along with it.

OK, this is straight forward for what need to have. If go in more detail you will need to wait for certain DOM elements to show up and such but Casperjs have enough functions for this and should be able to find it inside documentation.

So my question was how can I make it click to the next page and do the same process again?

Making the url list of each pages and looping the process is one way, but this time I wanted to go with clicking the next button as it was crawling the pages through pagination made by JavaScript.

Here is the thing I figured works well with crawling and doing same process again.

In real use case I do little more but this will cover how run() can be called multiple times and continue the same process for crawling data. Key is wrap each process with function and call it with .call(this) which ‘this’ variable is Casperjs object itself. As far as I know, when running Capserjs, it is like running with one browser and you can not have multiple Capserjs in one command. You can run multiple different Casperjs source code with multiple commands, but can use only one Casperjs object inside code.

Below will be the code for Yellow Page Crawler I have created.

https://github.com/h-nasu/yellowpages

Leave a Reply

Your email address will not be published. Required fields are marked *

ERROR: si-captcha.php plugin says GD image support not detected in PHP!

Contact your web host and ask them why GD image support is not enabled for PHP.

ERROR: si-captcha.php plugin says imagepng function not detected in PHP!

Contact your web host and ask them why imagepng function is not enabled for PHP.