This module introduces the main methods of text analysis using natural language processing (NLP) techniques, from a computer / data science perspective. The methods are introduced in relation to concrete applications, in order to extract meaningful, structured knowledge in several dimensions from large amounts of unstructured texts.  The knowledge and applications are complementary to those of information retrieval, with several commonalities (e.g. document representation), and advanced IR topics will be included as well.


This module is divided into three parts, each of them starting with the description of one or more text analysis problems. Then, the main methods needed to address them are defined, emphasizing their generality and reusability. Finally, for each part, the methods are instantiated and combined to enable concrete applications.

The three parts are organized by increased sophistication of the analysis of language in texts:

● Text analysis using bags-of-words (i.e. texts are considered as sets of independent words)
● Text analysis using sequences of words
● Text analysis using sentence structure (i.e. considering also the dependencies between words)