Tales of a Coronavirus Pandemic: Topic Modelling with Short-Text Data

It appears your Web browser is not configured to display PDF files. Download adobe Acrobat or click here to download the PDF file.

Click here to download the PDF file.


Shen, Adam




With more than 13 million tweets collected spanning between March 2020 to November 2020 relating to the COVID-19 global pandemic, the topics of discussion are investigated using topic models - statistical models that learn latent topics present in a collection of documents. Topic modelling is first conducted using Latent Dirichlet Allocation (LDA), a method that has seen great success when applied to formal texts. As LDA attempts to learn latent topics by analysing term co-occurrences within documents, it can encounter difficulties in the learning process when presented with shorter documents such as tweets. To address the inadequacies of LDA applied to short-text, a second topic modelling technique is considered, known as the Biterm Topic Model (BTM), which instead analyses term co-occurrences over the entire collection of documents. Comparing the performances of LDA and BTM, it was found that the topic quality of BTM was superior to that of LDA.






Carleton University

Thesis Degree Name: 

Master of Science: 

Thesis Degree Level: 


Thesis Degree Discipline: 


Parent Collection: 

Theses and Dissertations

Items in CURVE are protected by copyright, with all rights reserved, unless otherwise indicated. They are made available with permission from the author(s).