|Autor:||M. Keller, M. Nussbaumer||Links:||Download des PDF aus der ACM Digital Library|
|Quelle:||1st International Workshop on Large Scale Network Analysis, Proceedings of the 21nd International World Wide Web Conference (WWW2012), pp. 1025-1034, Lyon, France, April 2012|
The foundation of almost all web sites' information architecture is a hierarchical content organization. Thus information architects put much effort in designing taxonomies that structure the content in a comprehensible and sound way. The taxonomies are obvious to human users from the site's system of main and sub menus. But current methods of web structure mining are not able to extract these central aspects of the information architecture. This is because they cannot interpret the visual encoding to recognize menus and their rank as humans do. In this paper we show that a web site's main navigation system can not only be distinguished by visual features but also by certain structural characteristics of the HTML tree and the web graph. We have developed a reliable and scalable solution that solves the problem of extracting menus for mining the information architecture. The novel MenuMiner-algorithm allows retrieving the original content organization of large-scale web sites. These data are very valuable for many applications, e.g. the presentation of search results. In an experiment we applied the method for finding site boundaries within a large domain. The evaluation showed that the method reliably delivers menus and site boundaries where other current approaches fail.