Please use this identifier to cite or link to this item: http://hdl.handle.net/11452/33097
Title: Character n-gram application for automatic new topic identification
Authors: Uludağ Üniversitesi/Mühendislik Fakültesi/Endüstri Mühendisliği Bölümü.
0000-0003-0159-8529
Çağlar, Burcu Gençosman
Özmutlu, Hüseyin Cenk
Özmutlu, Seda
AAH-4480-2021
ABH-5209-2020
AAG-8600-2021
56263661900
6603061328
6603660605
Keywords: Content-ignorant algorithms
The levenshtein edit-distance
New topic identification
The character n-gram method
Pre-processed spelling correction methods
Neural-network applications
Web
Categorization
Computer science
Information science & library science
Behavioral research
Search engines
Errors
Internet
Edit distance
Topic identification
Internet-based applications
Spelling correction
Minimizing the number of
Search engine performance
N-gram methods
Network methodologies
Algorithms
Issue Date: 26-Jun-2014
Publisher: Elsevier
Citation: Çağlar, B. G. vd. (2014). "Character n-gram application for automatic new topic identification". Information Processing and Management, 50(6), 821-856.
Abstract: The widespread availability of the Internet and the variety of Internet-based applications have resulted in a significant increase in the amount of web pages. Determining the behaviors of search engine users has become a critical step in enhancing search engine performance. Search engine user behaviors can be determined by content-based or content-ignorant algorithms. Although many content-ignorant studies have been performed to automatically identify new topics, previous results have demonstrated that spelling errors can cause significant errors in topic shift estimates. In this study, we focused on minimizing the number of wrong estimates that were based on spelling errors. We developed a new hybrid algorithm combining character n-gram and neural network methodologies, and compared the experimental results with results from previous studies. For the FAST and Excite datasets, the proposed algorithm improved topic shift estimates by 6.987% and 2.639%, respectively. Moreover, we analyzed the performance of the character n-gram method in different aspects including the comparison with Levenshtein edit-distance method. The experimental results demonstrated that the character n-gram method outperformed to the Levensthein edit distance method in terms of topic identification.
URI: https://doi.org/10.1016/j.ipm.2014.06.005
https://www.sciencedirect.com/science/article/pii/S0306457314000521
http://hdl.handle.net/11452/33097
ISSN: 0306-4573
1873-5371
Appears in Collections:Scopus
Web of Science

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.