Skip Navigation

IEICE Transactions on Information and Systems 2008 E91-D(2):200-208; doi:10.1093/ietisy/e91-d.2.200
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by ASANO, Y.
Right arrow Articles by NISHIZEKI, T.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Copyright © 2008 The Institute of Electronics, Information and Communication Engineers

Special Section on Foundations of Computer Science -- Papers -- Scoring Algorithms

Improvements of HITS Algorithms for Spam Links

Yasuhito ASANO1, Yu TEZUKA2 and Takao NISHIZEKI3

1 The author is with the Department of Information Sciences, Faculty of Science and Engineering, Tokyo Denki University, Saitama-ken, 350–0394 Japan. E-mail: asano{at}j.dendai.ac.jp, 2 The author is with Mobile Entertainment Category, Victor Company of Japan, Limited, Maebashi-shi, 371–8543 Japan., 3 The author is with the Graduate School of Information Sciences, Tohoku University, Sendai-shi, 980–8579 Japan.

The HITS algorithm proposed by Kleinberg is one of the representative methods of scoring Web pages by using hyperlinks. In the days when the algorithm was proposed, most of the pages given high score by the algorithm were really related to a given topic, and hence the algorithm could be used to find related pages. However, the algorithm and the variants including Bharat's improved HITS, abbreviated to BHITS, proposed by Bharat and Henzinger cannot be used to find related pages any more on today's Web, due to an increase of spam links. In this paper, we first propose three methods to find "linkfarms," that is, sets of spam links forming a densely connected subgraph of a Web graph. We then present an algorithm, called a trust-score algorithm, to give high scores to pages which are not spam pages with a high probability. Combining the three methods and the trust-score algorithm with BHITS, we obtain several variants of the HITS algorithm. We ascertain by experiments that one of them, named TaN+BHITS using the trust-score algorithm and the method of finding linkfarms by employing name servers, is most suitable for finding related pages on today's Web. Our algorithms take time and memory no more than those required by the original HITS algorithm, and can be executed on a PC with a small amount of main memory.

Key Words: scoring algorithm, Web pages, HITS, BHITS, PageRank, search engine, Web graph, spam page, spam links


Manuscript received January 25, 2007. Manuscript revised June 3, 2007.

Reference

[1] Alta Vista, Digital Equipment Corporation, http://altavista.digital.com/

[2] Y. Asano, Y. Tezuka, and T. Nishizeki, "HITS on today's Web," IEICE Technical Report, COMP2005-62, 2006.

[3] K. Bharat and M.R. Henzinger, "Improved algorithms for topic distillation in a hyperlinked environment," Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.104–111, 1998.

[4] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, "Finding authorities and hubs from link structures on the World Wide Web," Proc. 10th International World Wide Web Conference, pp.415–429, 2001.

[5] S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine," Proc. 7th International World Wide Web Conference, pp.14–18, 1998.

[6] S. Chakrabarti, B. Dom, D. Gibson, J. Keinberg, P. Raghavan, and S. Rajagopalan, "Automatic resource compilation by analyzing hyperlink structure and associated text," Proc. 7th International World Wide Web Conference, pp.65–74, 1998.

[7] A.C. Carvalho, P. Chirita, E. Moura, P. Calado, and W. Nejdl, "Site level noise removal for search engines," Proc. 15th International World Wide Web Conference, pp.73–82, 2006.

[8] D. Fetterly, M. Manasse, and M. Najork, "Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages," Proc. 7th International Workshop on the Web and Databases, pp.1–6, 2004.

[9] D. Fetterly, M. Manasse, M. Najork, and A. Ntoulas, "Detecting spam Web pages through content analysis," Proc. 15th International World Wide Web Conference, pp.83–92, 2006.

[10] Google, Google Inc., http://www.google.com/

[11] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen, "Combating Web spam with TrustRank," Proc. 30th International Conference on Very Large Databases, pp.576–587, 2004.

[12] G. Jeh and J. Widom, "SimRank: A measure of structual-context similarity," Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.538–543, 2002.

[13] J. Kleinberg, "Authoritative sources in a hyperlinked environment," Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pp.668–677, 1998.

[14] R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the tkc effect," Proc. 9th International World Wide Web Conference, pp.387–401, 2000.

[15] L. Li, Y. Shang, and W. Zhang, "Improvement of HITS-based algorithms on Web documents," Proc. 11th International World Wide Web Conference, pp.527–535, 2002.

[16] S. Nomura, S. Oyama, T. Hayamizu, and T. Ishida, "Analysis and improvement of HITS algorithm for detecting Web communities," Proc. 2002 Symposium on Applications and the Internet, pp.132–140, 2002.

[17] M. Wang, "A significant improvement to Clever algorithm in hyperlinked environment," Poster Proc. 11th International World Wide Web Conference, 2002.

[18] X. Wang, Z. Lu, and A. Zhou, "Topic exploration and distillation for web search by a similarity-based analysis," Proc. 3rd International Conference on Web-Age Information Management, pp.316–327, 2002.

[19] B. Wu and B.D. Davison, "Identifying link farm spam pages," Proc. 14th International World Wide Web Conference, pp.820–829, 2005.

[20] Yahoo!, Yahoo! Corporation, http://www.yahoo.com/

[21] Y. Zhang, J.X. Yu, and J. Hou, Web Communities: Analysis and Construction, Springer, Berlin, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by ASANO, Y.
Right arrow Articles by NISHIZEKI, T.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?