I have a lot of gigabytes of Facebook / Twitter / RSS data.
I use it to track, by general generalization, a generalized population of the evolution of hyperparathyroidism from someone who is diagnosing someone with the drugs they took, treatment methods and end results.
I am new to NLTK and I have excellent Python / SQL experience.
All my data is parathyroid ; however, as you can see below (data from the twitter example), this is linguistically terrible:
omg i think my parathyroid is screwed up!!! Have been stuck at parathyroid hormone. STOP GETTING ON TWITTER JASMINE. Cryopreservation of Parathyroid Tissue after Parathyroid Surgery for Renal Hyperparathyroidism The Parathyroid as a Target for Radiation Damage it for the parathyroid hormone la
All this data is stored in a database. We also have fields like poster, zip code, message text, etc.
I was wondering if anyone could point me in the right direction for the following:
- Are there effective algorithms to help me do what I need?
- Linguistically, how can we find correlations in data? We are trying to track patterns.
- Is there some kind of "mesh" form in which I have to put the data to help with the analysis?
source share