Performance of TF-IDF for Text Classification Reviews on Google Play Store: Shopee

Najwa Umaira Che Mohd Safawi; Nur Amalina Shafie

doi:10.24191/jcrinn.v9i2.410

Performance of TF-IDF for Text Classification Reviews on Google Play Store: Shopee

Authors

Najwa Umaira Che Mohd Safawi College of Computing, Informatics and Mathematics, Universiti Teknologi MARA Negeri Sembilan, Seremban Campus, Negeri Sembilan, Malaysia
Nur Amalina Shafie College of Computing, Informatics and Mathematics, Universiti Teknologi MARA Negeri Sembilan, Seremban Campus, Negeri Sembilan, Malaysia

DOI:

https://doi.org/10.24191/jcrinn.v9i2.410

Keywords:

TF-IDF, Text Classification, Shopee, feature extraction, text normalization

Abstract

TF-IDF is a technique used to extract features in the field of text classification. The TF-IDF approach extracts feature by considering the frequencies of terms and their inverse document frequencies. The performance of various feature extraction methods varies, and it is necessary to determine the most appropriate approach for accurately classifying Shopee's application user reviews to enhance the user experience in Malaysia. This study aims to assess the efficacy of TF-IDF in text classification tasks, analyze their advantages and disadvantages, and identify the specific conditions in TF-IDF. The study employs a dataset of Shopee customer reviews acquired from the Google Play Store as the main data source. The methodology entails pre-processing the text data by applying a text normalization procedure that includes several processes, such as eliminating stop words, Unicode characters, and lemmatizing. The Logistic Regression, Support Vector Machine, and Decision Tree classifiers are trained and graded using both feature extraction approaches. The research notes that the efficacy of feature extraction approaches may differ based on the specific data set and task being considered. Subsequent studies might examine alternative methods of extracting features and assess their efficacy across various domains and datasets.

Downloads

Download data is not yet available.

References

Annie, A. (2021). The state of mobile 2021. https://www.appannie.com/en/go/state-of-mobile-2021/

Apptentive. (2019). Mobile app benchmarks: The average ratings, reviews, and retention rates. Apptentive Blog. https://www.apptentive.com/blog/2019/03/11/mobile-app-benchmarks-the-average-ratings-reviews-and-retention-rates/

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Bush, T. (2020, Jun). Descriptive analysis: How-to, types, examples. Pestle Analysis. https://pestleanalysis.com/descriptive-analysis/

Duan, W., Gu, B., & Whinston, A. B. (2008). Do online reviews matter? An empirical investigation of panel data. Decision Support Systems, 45(4), 1007–1016.

Hu, P., Li, Q., & Ye, Y. (2019). Customer review analysis using natural language processing techniques: A case study of e-commerce platforms. Sustainability, 11(8), 2234.

iPrice Group. (2021). iPrice insights: State of ecommerce in Southeast Asia 2021. https://iprice.my/insights/mapofecommerce/en/

javaTpoint. (2021). Machine learning decision tree classification algorithm - javatpoint. https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm

Kanaris, I., Stamatatos, E., & Fakotakis, N. (2020). Tf-idf vs word2vec vs glove: An overview. arXiv preprint. arXiv:2010.02545

Kim, M.-G., & Yoon, Y.-J. (2020). Negative consequences of text classification: A critical review and practical remedies. Journal of the Association for Information Science and Technology, 71(8), 936–947.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2015). Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781

Raj, A. (2020, Nov). Perfect recipe for classification using logistic regression. Towards Data Science. https://towardsdatascience.com/the-perfect-recipe-for-classification-using-logistic-regression-f8648e267592#:~:text=Logistic%20regression%20is%20a%20classification

Rajendran, S. (2021). Improving the performance of global courier & delivery services industry by analyzing the voice of customers and employees using text analytics. International Journal of Logistics Research and Applications, 24(5), 473-493. https://doi.org/10.1080/13675567.2020.1769042 doi:10.1080/13675567.2020.1769042

Similarweb. (2023, Jul). Top websites ranking most visited ecommerce shopping websites in Malaysia. Similarweb LTD. https://www.similarweb.com/top-websites/malaysia/e-commerce-andshopping/#:~:text=shopee.com.my%20ranked%20number,eCommerce%20%26%20Shopping%20websites%20in%20Malaysia.

Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21.

Sunil, R. (2019, Mar). Understanding support vector machine algorithm from examples (along with code). https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

Turney, P. D. (2002). Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. Association for Computational Linguistics.

Wei, H. (2019, Mar). Nlp pipeline 101 with basic code example—feature extraction. Voice Tech Podcast. https://medium.com/voice-tech-podcast/nlp-pipeline-101-with-basic-code-example-feature-extraction-ea9894ed8daf#:~:text=Feature%20extraction%20step%20means%20to