Deep Learning Approaches for Classifying Informal and Formal English Texts Using Linguistic Features
Keywords:
ANN; CNN; Document Classification; Formal Documents; Informal Documents; LSTMAbstract
Effective techniques for automatically classifying texts are becoming increasingly necessary due to the exponential expansion of digital material. Differentiating between formal and informal documents can help students identify appropriate resources for their assignments and improve the effectiveness of information retrieval systems. Although machine learning is extensively utilized in classification of text, there is a lack of research focused to the effective differentiation of formal and informal writings through linguistic features. This gap highlights the necessity for advanced methodologies that improve classification accuracy and enhance the value of digital content in academic and retrieval systems. Our research addresses the problem by utilizing deep learning methodologies and a wide range of 13 linguistic attributes to get enhanced efficacy in text classification. Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN), and Long Short Term Memory Networks (LSTM) were considered. A dataset , including both formal (news articles, formal documents) and informal (personal letters, personal blogs) texts, were gathered from several web sources. We considered linguistic markers such as colloquialisms, contractions, modal verbs, slang, acronyms, pronouns, phrasal verbs, grammar complexity, vocabulary complexity, voice, and language type to generate the feature vector. The feature vectors were utilized to train and assess the classification models using several cross validation techniques, particularly 3, 5, 7, and 10 folds. The efficacy of the models was evaluated using performance indicators, f-measure, accuracy, precision, and recall. With the highest accuracy of 99.8% and resilience in differentiating between formal and informal texts, the LSTM model outperformed than the others. Future research will examine big datasets, more linguistic characteristics, sophisticated deep learning models, and real-time and multilingual classification systems.
References
R. Johnson, K. Hyland, and K. Jiang, “Clear Language and Avoiding Ambiguity in Academic Writing,” pp. 1 27, 2012, [Online]. Available: https://www.sciencedirect.com/science/article/pii/S088 9490616301016
A. Kukulska-Hulme, “Language as a bridge connecting formal and informal language learning through mobile devices,” Seamless Learn. Age Mob. Connect., pp. 281–294, 2015, doi: 10.1007/978 981-287-113-8_14.
K. Yasar, “What is computational linguistics?” [Online]. Available: https://www.techtarget.com/searchenterpriseai/definition/computational-linguistics-CL
J. Hu, Y., Wang, L., & Zhou, “Enhancing text classification with linguistic feature integration,” J. Comput. Linguist., vol. 48, no. 3, pp. 215-230., 2022.
K. Smith, A., & Lee, “Advances in computational morphology,” A Surv. Nat. Lang. Process. J., vol. 52, no. 1, pp. 45–67, 2024.
Y. Brown, T., Kumar, P., & Chen, “Neural networks vs. linguistic features: A comparative study in text classification,” A Comp. study text Classif. Mach. Learn. NLP Rev., vol. 29, no. 2, pp. 134–150, 2022.
R. Hopkins, “Formal and Informal Language,” Educ. Action, pp. 120–136, 2022, doi: 10.1163/9789004523876_009.
B. Counsil, “10 differences between formal and informal language.” Accessed: Jan. 01, 2024. [Online]. Available: https://www.londonschool.com/blog/10 differences-between-formal-and-informal-language/
Murray, “Active and Passive Voice (Handout),” Gramm. Mech. Act. Passiv. Voice, pp. 19–20, 2018.
K. Pal and B. V. Patel, “Automatic multiclass document classification of hindi poems using machine learning techniques,” 2020 Int. Conf. Emerg. Technol. INCET 2020, pp. 11–15, 2020, doi: 10.1109/INCET49848.2020.9154001.
A. B. Adetunji, J. P. Oguntoye, O. D. Fenwa, and N. O. Akande, “Web Document Classification Using Naïve Bayes,” J. Adv. Math. Comput. Sci., vol. 29, no. 6, pp. 1–11, Dec. 2018, doi: 10.9734/jamcs/2018/34128.
B. Agarwal and N. Mittal, “Text classification using machine learning methods-a survey,” in Advances in Intelligent Systems and Computing, Springer Verlag, 2014, pp. 701–709. doi: 10.1007/978-81-322-1602 5_75.
B. Kaur and G. Bathla, “Document Classification using Various Classification Algorithms: A Survey,” Int. Futur. Revolut. Comput. Sci. Commun. Eng. IJFRCSCE, 2018, [Online]. Available: http://www.ijfrcsce.org
M. Baygin, “Classification of Text Documents based on Naive Bayes using N-Gram Features.” [Online]. Available: https://drive.google.com/open?id=1Idp5VK1Q91vyqb940WjeoM
C. S. Lim, K. J. Lee, and G. C. Kim, “Multiple sets of features for automatic genre classification of web documents,” Inf. Process. Manag., vol. 41, no. 5, pp.1263–1276, Sep. 2005, doi: 10.1016/j.ipm.2004.06.004.
E. B. B. Palad, M. S. Tangkeko, L. A. K. Magpantay, and G. L. Sipin, “Document Classification of Filipino Online Scam Incident Text using Data Mining Techniques,” Proc. - 2019 19th Int. Symp. Commun. Inf. Technol. Isc. 2019, pp. 232–237, 2019, doi: 10.1109/ISCIT.2019.8905242.
P. H. Seo, Z. Lin, S. Cohen, X. Shen, and B. Han, “Hierarchical Attention Networks,” ArXiv, pp. 1480 1489, 2016, [Online]. Available: http://arxiv.org/abs/1606.02393
H. Schwenk and X. Li, “A corpus for multilingual document classification in eight languages,” Lr. 2018 -11th Int. Conf. Lang. Resour. Eval., pp. 3548–3551, 2019.
A. Adhikari, A. Ram, R. Tang, and J. Lin, “DocBERT: BERT for Document Classification,” 2019, [Online]. Available: http://arxiv.org/abs/1904.08398
M. Da Silva Conrado, V. A. Laguna Gutiérrez, and S. O.Rezende, “Evaluation of normalization techniques in text classification for Portuguese,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012. doi: 10.1007/978-3-642-311376_47.
M. N. Asim, M. U. G. Khan, M. I. Malik, A. Dengel, and S. Ahmed, “A robust hybrid approach for textual document classification,” Proc. Int. Conf. Doc. Anal. Recognition, ICDAR, pp. 1390–1396, 2019, doi:10.1109/ICDAR.2019.00224.
X. Huang and M. J. Paul, “Examining temporality in document classification,” ACL 2018 - 56th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap., vol. 2, pp. 694–699, 2018, doi: 10.18653/v1/p18-2110.
Z. Kastrati, A. S. Imran, and S. Y. Yayilgan, “The impact of deep learning on document classification using semantically rich representations,” Inf. Process. Manag., vol. 56, no. 5, pp. 1618–1632, 2019, doi:10.1016/j.ipm.2019.05.003.
Z. E. Rasjid and R. Setiawan, “Performance Comparison and Optimization of Text Document Classification using k-NN and Naïve Bayes Classification Techniques,” Procedia Comput. Sci., vol. 116, pp. 107 112, 2017, doi: 10.1016/j.procs.2017.10.017.
M. Usman, Z. Shafique, S. Ayub, and K. Malik, “Urdu Text Classification using Majority Voting,” Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 8, pp. 265–273, 2016, doi: 10.14569/ijacsa.2016.070836.
D. Buzic and J. Dobsa, “Lyrics classification using Naive Bayes,” 2018 41st Int. Conv. Inf. Commun. Technol. Electron. Microelectron. MIPRO 2018 - Proc., pp. 1011–1015, 2018, doi: 10.23919/MIPRO.2018.8400185.
R. A. Calvo, J. M. Lee, and X. Li, “Managing content with automatic document classification,” J. Digit. Inf.,vol. 5, no. 2, pp. 1–15, 2004.
F. A. Sheikha and D. Inkpen, “Linguistic Issues in Language Technology-LiLT Submitted,” 2012.
S. Jin, A. P. de Vries, A. Szuba, and D. Hiemstra, “Classification and Interchange of Informal and Formal English Text,” 2022.
K. M. G. S. Karunarathna, R. A. H. M. Rupasingha, and B. T. G. S. Kumara, “Classifying Documents based on Formal and Informal Writing Styles using Machine Learning Algorithms,” ICARC 2022 - 2nd Int. Conf. Adv. Res. Comput. Towar. a Digit. Empower. Soc., pp. 373–378, 2022, doi: 10.1109/ICARC54489.2022.9753774.
K. M. G. . Karunarathna, R. A. H. . Rupasingha, and B. T. G. . Kumara, “An Ensemble Learning Approach to Classifying Documents Based on Formal and Informal Writing Styles,” p. 2022, 2022.
A. Dementieva, D., Babakov, N., & Panchenko, “Detecting Text Formality: A Study of Text ClassificationApproaches,” Proc. Int. Conf. Recent Adv. Nat. Lang. Process. (RANLP 2023), pp. 239–247, 2023.
X. Liu, Y., & Zhang, “Linguistic Driven Feature Selection for Text Classification as Stop Words.,” J. Adv. Inf. Technol., vol. 14, no. 4, pp. 796–803, 2023.
M. Chen, J., & Li, “Sentence Formality Prediction with Deep Learning,” in Proceedings of the IEEE 23rd International Conference on Information Reuse and Integration for Data Science (IRI), 2022, pp. 1–8.
Kaggle, “No Title.” Accessed: May 27, 2022. [Online]. Available: https://www.kaggle.com/
The washington post, “The Washintong Post.” Accessed: Jan. 10, 2024. [Online]. Available: https://www.washingtonpost.com
writinghelp-central, “writinghelp-central.” Accessed: Jan. 10, 2022. [Online]. http://www.writinghelp central.com/
answershark, “answershark.” Accessed: Jan. 15, 2022. [Online]. Available: https://answershark.com
lettersfree, “lettersfree.” Accessed: Jan. 15, 2022. [Online]. Available: https://www.lettersfree.com
geeksforgeeks, “What is LSTM, ANN and CNN.” Accessed: Jun. 27, 2024. [Online]. Available:
https://www.geeksforgeeks.org/deep-learning introduction-to-long-short-term-memory/?ref=header_search
geeksforgeeks, “What is LSTM – Long Short Term Memory?” Accessed: Nov. 14, 2024. [Online]. Available: https://www.geeksforgeeks.org/deep learning-introduction-to-long-short-term-memory/?ref=header_outind