阿布云

你所需要的,不仅仅是一个好用的代理。

Resumable Mode — WebCollector Tutorial

阿布云 发表于

What is resumable mode?

Resumable mode makes it possible to resume a crawl that has terminated either expectedly or unexpectedly. In other words, the crawler would start crawling with the history data generated by the previously stopped crawl.

By default, the resumable mode is disabled. If you haven’t enabled the resumable mode, the history data — which stores the information about which urls have been successfully fetched and which are not fetched yet — will be deleted at the beginning of the Crawler.start(int round) method, which is used to start the crawler. Thus the restarted crawler would ignore the history information generated by the previous crawl, fetching webpages that have already been downloaded before.

For example, a BreadthCrawler instance uses a specified folder to store the history data. In non-resumable mode, that folder will be deleted everytime you call the Crawler.start(int round) method. As soon as the folder is deleted, a new folder that contains empty history data will be created to replace the previous folder, providing history manager function for the crawler. The BreadthCrawler instance will then inject seeds into the empty history data and start the iterative crawling processes. The history data created by the crawling processes will be cleared once the Crawler.start(int round) method is involved. As a result, the BreadthCrawler instance starts a completely new crawling task everytime you call the Crawler.start(int round) method.

How to enable resumable mode?

To enable resumable mode, just add crawler.setResumable(true) before you start the crawling task:


Crawler crawler;
...
crawler.setResumable(true);
crawler.start(xxx);
Notice

There are a few things to mention about resumable mode:

Notice that if you involve the Crawler.start(int round) method in non-resumable mode, all your history data would be deleted. Make sure your crawler is always in resumable mode if you don’t want to lose your history data.

Resumable mode is not applicable to RamCrawler.

Make sure your crawler uses the same crawlpath as the previous crawling task.