I was looking for website that list all the company contacts in Thailand for email marketing. My friend introduced me to Thai Yellow Page Website for this. After I have searched for some companies in certain category, it showed many company list with duplicated results…
http://www.yellowpages.co.th/en/ypsearch?q=Hospitals&w=Bangkok
So my thinking was, how can I get all this data to Couchdb so I can do view search and reduce the results…
Casperjs (headless browser for crawling) does good work for crawling the data and overcoming any JavaScript challenges. Way the Casperjs run is similar to test suite, where you lay down the process (code) and let it run to see the results. I think some people use it for A2A Testing but as far as I know for Angularjs developments Protractor with Selenium WebDriver is more suitable for this. If go on about A2A Testing people usually goes with the one popular to that Framework so my understanding and usage of Casperjs is basically crawling and scraping data like real user.
Scraping one page data is not too difficult (usage of API is well documented inside website), so if you like JavaScript should not be much problem getting along with it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
var casper = require('casper').create(); var data = ''; // Beginning of Casperjs Process for URL to crawl casper.start(url); // First Process after web page is open casper.then(function(){ data = this.getTitle(); }); // Second Process casper.then(function(){ // Send data to Couchdb var options = { data: data, // and some more options } this.open(couchdbUrl, options).then(function(){ // Finish Casperjs process this.exit(); }); }); // This will trigger Casperjs to run with above process casper.run(cb); |
OK, this is straight forward for what need to have. If go in more detail you will need to wait for certain DOM elements to show up and such but Casperjs have enough functions for this and should be able to find it inside documentation.
So my question was how can I make it click to the next page and do the same process again?
Making the url list of each pages and looping the process is one way, but this time I wanted to go with clicking the next button as it was crawling the pages through pagination made by JavaScript.
Here is the thing I figured works well with crawling and doing same process again.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
var casper = require('casper').create(); var data = ''; // Beginning of Casperjs Process for URL to crawl casper.start(url).then(function){ first.call(this); }); // First Process after web page is open function first() { data = this.getTitle(); second.call(this); } // Second Process function second() { var options = { data: data, // and some more options } this.open(couchdbUrl, options).then(function(){ // Now not to exit but run the process with check() callback to move on this.run(check); }); } // Check if continue from the First Process function check() { if (this.exists('.next-page')) { this.click('.next-page'); first.call(this); } else { this.exit(); } } // This will trigger Casperjs to run with above process casper.run(check); |
In real use case I do little more but this will cover how run() can be called multiple times and continue the same process for crawling data. Key is wrap each process with function and call it with .call(this) which ‘this’ variable is Casperjs object itself. As far as I know, when running Capserjs, it is like running with one browser and you can not have multiple Capserjs in one command. You can run multiple different Casperjs source code with multiple commands, but can use only one Casperjs object inside code.
Below will be the code for Yellow Page Crawler I have created.