IIT Home Page CNR Home Page

Medium sized crawling made fast and easy through Lumbricus webis

Web crawlers have become popular tools for collectinglarge portions of the web that can be used for manytasks from statistics to structural analysis of the web.Due to the amount of data and the heterogeneity of tasksto manage, it is essential for crawlers to have a modularand distributed architecture. In this paper we describeLumbricus webis, (short L. webis) a modular crawlinginfrastructure built to mine data from the .it domainand portions of the web reachable from it. The purposeof our crawler is to support gathering of advancedstatistics, and advanced analytic tools on the content ofthe Italian Web. This paper describes the architecturalfeatures of L. webis and its performance. L. webis cancurrently download a mid-sized ccTID such as ".it" inabout one week.

International Conference on Machine Learning and Cybernetics, Guilin, 2011

Autori IIT:

Claudio Felicioli

Foto di Claudio Felicioli

Tipo: Articolo in Atti di convegno internazionale con referee
Area di disciplina: Computer Science & Engineering

Attività: Algoritmica per tecnologie web