WLI: a Web application for Language Identification and evaluation of available tools

Web Language Identi er (WLI) is a service that, startingfrom the URL of a Web page or a plain text and exploiting a pool oflanguage identi fication tools, returns a set of candidate languages witha confi dence score. Currently embedded tools are Chromium CompactLanguage Detector, Lingua::Identify, and a simple one based on HTML attributes. The service can be exploited through a Web application orvia an API. To globally evaluate the identi fiers, we constructed a test set of Web pages extracted from 146 Wikipedia projects. This allows using WLI also as a service to compare language identi fication tools in terms of supported languages and precision of the results. The charts summarizing the comparison can be visualized in the WLI Web application. We plan to extend the service making it possible for the users to add their own identifi er.


Autori IIT:

Tipo: TR Rapporti tecnici
Area di disciplina: Computer Science & Engineering
IIT TR-18/2012

File: TR-18-2012.pdf

Attività: Multilingual Web