Web crawlers have become popular tools for collectinglarge portions of the web that can be used for manytasks from statistics to structural analysis of the web.Due to the amount of data and the heterogeneity of tasksto manage, it is essential for crawlers to have a modularand distributed architecture. In this paper we describeLumbricus webis, (short L. webis) a modular crawlinginfrastructure built to mine data from the .it domainand portions of the web reachable from it. The purposeof our crawler is to support gathering of advancedstatistics, and advanced analytic tools on the content ofthe Italian Web. This paper describes the architecturalfeatures of L. webis and its performance. L. webis cancurrently download a mid-sized ccTID such as ".it" inabout one week.

International Conference on Machine Learning and Cybernetics, Guilin, 2011

