Copyright © 2008 The Institute of Electronics, Information and Communication Engineers
Special Section on Knowledge-Based Software Engineering -- Letter |
An Informative DOM Subtree Identification Method from Web Pages in Unfamiliar Web Sites
1 The authors are with the Department of Knowledge-based Information Engineering, Toyohashi University of Technology, Toyohashi-shi, 441–8580 Japan. E-mail: tsuruta{at}smlab.tutkie.tut.ac.jp; sakai{at}smlab.tutkie.tut.ac.jp; masuyama{at}tutkie.tut.ac.jp
| Abstract |
|---|
We propose a method of informative DOM* subtree identification from a Web page in an unfamiliar Web site. Our method uses layout data of DOM nodes generated by a generic Web browser. The results show that our method outperforms a baseline method, and was able to identify informative DOM subtrees from Web pages robustly.
Key Words: informative region identification, Web document, DOM, layout analysis
Manuscript received July 2, 2007. Manuscript revised October 17, 2007.