Copyright © 2008 The Institute of Electronics, Information and Communication Engineers
Special Section on Foundations of Computer Science -- Papers -- Scoring Algorithms |
Improvements of HITS Algorithms for Spam Links
1 The author is with the Department of Information Sciences, Faculty of Science and Engineering, Tokyo Denki University, Saitama-ken, 350–0394 Japan. E-mail: asano{at}j.dendai.ac.jp, 2 The author is with Mobile Entertainment Category, Victor Company of Japan, Limited, Maebashi-shi, 371–8543 Japan., 3 The author is with the Graduate School of Information Sciences, Tohoku University, Sendai-shi, 980–8579 Japan.
The HITS algorithm proposed by Kleinberg is one of the representative methods of scoring Web pages by using hyperlinks. In the days when the algorithm was proposed, most of the pages given high score by the algorithm were really related to a given topic, and hence the algorithm could be used to find related pages. However, the algorithm and the variants including Bharat's improved HITS, abbreviated to BHITS, proposed by Bharat and Henzinger cannot be used to find related pages any more on today's Web, due to an increase of spam links. In this paper, we first propose three methods to find "linkfarms," that is, sets of spam links forming a densely connected subgraph of a Web graph. We then present an algorithm, called a trust-score algorithm, to give high scores to pages which are not spam pages with a high probability. Combining the three methods and the trust-score algorithm with BHITS, we obtain several variants of the HITS algorithm. We ascertain by experiments that one of them, named TaN+BHITS using the trust-score algorithm and the method of finding linkfarms by employing name servers, is most suitable for finding related pages on today's Web. Our algorithms take time and memory no more than those required by the original HITS algorithm, and can be executed on a PC with a small amount of main memory.
Key Words: scoring algorithm, Web pages, HITS, BHITS, PageRank, search engine, Web graph, spam page, spam links
Manuscript received January 25, 2007. Manuscript revised June 3, 2007.
Reference
[1] Alta Vista, Digital Equipment Corporation, http://altavista.digital.com/ [2] Y. Asano, Y. Tezuka, and T. Nishizeki, "HITS on today's Web," IEICE Technical Report, COMP2005-62, 2006. [3] K. Bharat and M.R. Henzinger, "Improved algorithms for topic distillation in a hyperlinked environment," Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.104–111, 1998. [4] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, "Finding authorities and hubs from link structures on the World Wide Web," Proc. 10th International World Wide Web Conference, pp.415–429, 2001. [5] S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine," Proc. 7th International World Wide Web Conference, pp.14–18, 1998. [6] S. Chakrabarti, B. Dom, D. Gibson, J. Keinberg, P. Raghavan, and S. Rajagopalan, "Automatic resource compilation by analyzing hyperlink structure and associated text," Proc. 7th International World Wide Web Conference, pp.65–74, 1998. [7] A.C. Carvalho, P. Chirita, E. Moura, P. Calado, and W. Nejdl, "Site level noise removal for search engines," Proc. 15th International World Wide Web Conference, pp.73–82, 2006. [8] D. Fetterly, M. Manasse, and M. Najork, "Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages," Proc. 7th International Workshop on the Web and Databases, pp.1–6, 2004. [9] D. Fetterly, M. Manasse, M. Najork, and A. Ntoulas, "Detecting spam Web pages through content analysis," Proc. 15th International World Wide Web Conference, pp.83–92, 2006. [10] Google, Google Inc., http://www.google.com/ [11] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen, "Combating Web spam with TrustRank," Proc. 30th International Conference on Very Large Databases, pp.576–587, 2004. [12] G. Jeh and J. Widom, "SimRank: A measure of structual-context similarity," Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.538–543, 2002. [13] J. Kleinberg, "Authoritative sources in a hyperlinked environment," Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pp.668–677, 1998. [14] R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the tkc effect," Proc. 9th International World Wide Web Conference, pp.387–401, 2000. [15] L. Li, Y. Shang, and W. Zhang, "Improvement of HITS-based algorithms on Web documents," Proc. 11th International World Wide Web Conference, pp.527–535, 2002. [16] S. Nomura, S. Oyama, T. Hayamizu, and T. Ishida, "Analysis and improvement of HITS algorithm for detecting Web communities," Proc. 2002 Symposium on Applications and the Internet, pp.132–140, 2002. [17] M. Wang, "A significant improvement to Clever algorithm in hyperlinked environment," Poster Proc. 11th International World Wide Web Conference, 2002. [18] X. Wang, Z. Lu, and A. Zhou, "Topic exploration and distillation for web search by a similarity-based analysis," Proc. 3rd International Conference on Web-Age Information Management, pp.316–327, 2002. [19] B. Wu and B.D. Davison, "Identifying link farm spam pages," Proc. 14th International World Wide Web Conference, pp.820–829, 2005. [20] Yahoo!, Yahoo! Corporation, http://www.yahoo.com/ [21] Y. Zhang, J.X. Yu, and J. Hou, Web Communities: Analysis and Construction, Springer, Berlin, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This Article ![]()
![]()
Abstract
![]()
Full Text (PDF)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Request Permissions
![]()
Google Scholar ![]()
![]()
Articles by ASANO, Y.
![]()
Articles by NISHIZEKI, T.
![]()
Search for Related Content
![]()
Social Bookmarking ![]()
![]()
What's this?