Detecting Trends in Small and Large Document Datasets

Abstract: Processing text datasets is an ever increasing challenge for small and large projects in various corporations. Given the amount of text data produced every day is increasing, we need to innovate constantly in terms of algorithm design and underlying engineering tools. In detail, the challenges are actually two fold. For small amounts of data in niche fields where data quality is also an issue (e.g. existence of biases), algorithmic innovation is of major importance, more so than support engineering tools. On the contrary, when data is abundant, the engineering challenges are more prevalent Рhow to gather, process and store considerable amounts of information. The detection of trends in small datasets need algorithms that converge fast and make use of prior knowledge related to the specific fields (e.g. inflection points in financing). Engineering support tools have to achieve automatic scalability and cost-savings depending on the computational loads. The talk will summarize algorithmic approaches related to processing text datasets for trend detection as well as underlying engineering support tools and ways they need to be architected.