Lumbricus webis: a parallel and distributed crawling architecture for the Italian web

Web crawlers have become popular tools for
gattering large portions of the web that can be used for many
tasks from statistics to structural analysis of the web. Due to
the amount of data and the heterogeneity of tasks to manage,
it is essential for crawlers to have a modular and distributed
architecture. In this paper we describe Lumbricus webis (short
L.webis) a modular crawling infrastructure built to mine data
from the web domain ccTLD .it and portions of the web
reachable from this domain. Its purpose is to support gathering
of advanced statics and advanced analytic tools on the content
of the Italian Web. This paper describes the architectural
features of L.webis and its performance. L.webis can currently
download a mid-sized ccTLD such as “.it" in about one week.

