I know this is a little late, but hope this helps. First you need to understand that LDA applies only to DTM (Document Term Matrix). Therefore, I suggest you follow these steps:
- Download csv file
- Extract required tweets from file
- Data cleansing
- Create a dictionary containing each word of the created body
- Creating a TDM Structure
- Set the structure to the data file
- Get Dictionary - TDM Functions (Words)
- Keep using the code above
You can provide this code here to help you get started -
token_dict = {} for i in range(len(txt1)): token_dict[i] = txt1[i] len(token_dict) print("\n Build DTM") %time tf = CountVectorizer(stop_words='english') print("\n Fit DTM") %time tfs1 = tf.fit_transform(token_dict.values())
source share