In the PyData conference in Seattle, Rutu Mulkar-Mehta PhD discusses the hypothetical case for the programmers involved about an innovative start-up in Seattle that hired them to analyze their linguistic data.
Of course, the major focus of this article is about the concept of the stopword, which interferes with the data that is extracted through the Python code. To put it simply, stopwords are essentially the most common lemmas that are always used in English. So, prepositions, such as and, of, a, and an, would count, as well as words such as python and learning. When trying to extract the most commonly used words in a database, these words tend to appear.
The important detail is that JSON is used, which is typically used for data interchange within deep learning. In which case, the data concerning word frequency becomes the focus of Mulkar-Mehta’s talk. Specifically, this type of data is referred to as Term Frequency (TF). The major problem within this hypothetical case is that all terms have the same value. As such, this would need to be addressed through the logarithm of the division between the total number of documents (N) from the Document Frequency (DF) with certain words or token specified in it (t). This results in the formula:
As for how the Term Frequency applies, it includes certain words and tokens that would need to be excluded, so the list of frequencies would be manageable. This TFt would be multiplied to both sides of the equation.
It is quite interesting seeing the interplay between programming and linguistics, which have more in common than one would think. What I had noticed throughout my time as an English major in my undergraduate and graduate time is that the increasing relevance of the digital humanities had become a well-known phenomenon. A major component to this, as Mulkar-Mehta indicated in one of her slides, is the Part-Of-Speech Tagging.
- PyData. “Rutu Mulkar-Mehta: Using Python For Linguistic Data Analysis.” YouTube. 2015.
- Lubanovic, Bill. “Introducing Python: Modern Computing in Simple Packages.” O’Reilly. 2015.