A study on Chinese register characteristics based on regression analysis and text clustering

Hou, R., Huang, C. R., & Liu, H. (2019). A study on Chinese register characteristics based on regression analysis and text clustering. Corpus Linguistics and Linguistic Theory, 15(1), 1-37. https://doi.org/10.1515/cllt-2016-0062

Abstract

This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aL ^b c ^L , where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.

Link to publication in Scopus

Link to publication in De Gruyter