We present the HTML Topic Model (HTM), a web content topic model that takes into consideration the HTML tags to understand the structure of web pages. This study aims to propose an innovative topic model to learn coherence topics in web content data. Neglecting the unique structure of web content leads to missing otherwise coherent topics and, therefore, low topic quality. Previous studies build topic models to generally work on conventional documents, and they are insufficient and underperform when applied to web content data due to differences in the structure of the conventional and HTML documents. Topic coherence is the standard metric to measure the quality of topic models. The usefulness of topic models depends on the quality of resulting term patterns and topics with high quality. The Internet of Things, Blockchain, recommender system, and search engine optimization applications use topic modeling to handle data mining tasks, such as classification and clustering. Topic modeling discovers latent semantic structures or topics within a set of digital textual documents. Utilizing topic modeling enhances the analysis and understanding of digital documents. PeerJ Computer Science 9: e1459 Īn immense volume of digital documents exists online and offline with content that can offer useful information and insights. Web content topic modeling using LDA and HTML tags. Cite this article Altarturi HHM, Saadoon M, Anuar NB. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. Licence This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. 2 Department of Software Engineering, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Kuala Lumpur, Malaysia DOI 10.7717/peerj-cs.1459 Published Accepted Received Academic Editor Daniel de Oliveira Subject Areas Data Mining and Machine Learning, World Wide Web and Web Science, Text Mining Keywords HTML topic model, HTM, Topic modeling, Topic models comparison, LDA, HTML tags, Web content mining, Web topic modeling, Generative model Copyright © 2023 Altarturi et al.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |