Copyright © 2008 The Institute of Electronics, Information and Communication Engineers
Special Section on Knowledge-Based Software Engineering -- Letter |
An Informative DOM Subtree Identification Method from Web Pages in Unfamiliar Web Sites
1 The authors are with the Department of Knowledge-based Information Engineering, Toyohashi University of Technology, Toyohashi-shi, 441–8580 Japan. E-mail: tsuruta{at}smlab.tutkie.tut.ac.jp; sakai{at}smlab.tutkie.tut.ac.jp; masuyama{at}tutkie.tut.ac.jp
We propose a method of informative DOM* subtree identification from a Web page in an unfamiliar Web site. Our method uses layout data of DOM nodes generated by a generic Web browser. The results show that our method outperforms a baseline method, and was able to identify informative DOM subtrees from Web pages robustly.
Key Words: informative region identification, Web document, DOM, layout analysis
Manuscript received July 2, 2007. Manuscript revised October 17, 2007.
Reference
[1] A. Finn, N. Kushmerick, and B. Smyth, "Fact or fiction: Content classification for digital libraries," DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, 2001. [2] S. Debnath, P. Mitra, N. Pal, and C. Giles, "Automatic identification of informative sections of web pages," IEEE Trans. Knowl. Data Eng., vol.17, no.9, pp.1233–1246, 2005. [3] D. Cai, S. Yu, J.R. Wen, and W.Y. Ma, "Vips: A vision-based page segmentation algorithm," Microsoft Technical Report MSR-TR-2003-79, 2003. [4] R. Song, H. Liu, J.R. Wen, and W.Y. Ma, "Learning block importance models for web pages," WWW '04: Proc. 13th International Conference on World Wide Web, pp.203–211, 2004. [5] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, "Dom-based content extraction of html documents," WWW '03: Proc. 12nd International Conference on World Wide Web, pp.207–214, 2003.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This Article ![]()
![]()
Abstract
![]()
Full Text (PDF)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Request Permissions
![]()
Google Scholar ![]()
![]()
Articles by TSURUTA, M.
![]()
Articles by MASUYAMA, S.
![]()
Social Bookmarking ![]()
![]()
What's this?