Topic Modelling Jewish Studies – Jewish Studies: State of the Field

The original impetus for this project was curiosity about whether topic modelling the major, general journals in Jewish studies could lead to a new understanding of the “state of the field.” It was inspired by Andrew Goldstone’s modelling of the PMLA, which can be found here. Our original thought was that we could simply download the data from the Data for Research service from JSTOR, run it through Goldstone’s code, and voilà, we would get a great, informative topic model.

For better and worse, this was not to be. JSTOR changed the format in which they supplied their data; Goldstone’s code is written in R when we wanted to use Python; and it was difficult for us to implement the actual server software. In rethinking what we wanted to do and implementation, though, we learned a great deal. The project is ongoing and rather than wrap it all up neatly, we decided that it would be an interesting experiment to try to present the data in a more interactive way.

First, though, what is a topic model? Topic modelling is a technique for seeing larger patterns in a (usually large) corpus of documents. It is technically an “unsupervised” method, in that the computer analyzes the words as individual tokens, with no regard for their semantic meanings, looking for patterns. In reality, there is quite a bit of human tweaking that goes into making these models. We have to set from the beginning the number of topics we want to look at and the number of words to be considered for each topic. We also have to determine which words should remain as significant and which should be eliminated (or “stopworded”) from analysis, being considered “noise”. For AJS Review, for example, we eliminated “Jew” and “Jewish” because they were so prevalent they became uninformative.

For our models, we used a technique used as Latent Direchlet Allocation (LDA). There is an element of probability in these models, so even using the same data will often present somewhat different (but widely overlapping) lists of topics and their components. We experimented (as one must) with the various parameters that go into making such a model until we got a set of topics that individually looked relatively coherent and collectively were diverse. Since AJS Review does not publish many articles each year, we divided our analyses into (approximately) five year blocks, ending with the last year of data available to us, in 2014. We also created more holistic visualizations that analyzed the whole corpus over time.

There are different ways for visualizing topics, but one of the more interesting and interactive ways comes out of a package known as pyLDAvis (cf. this description of topic model visualizations), which runs from the topic model package that we used (gensim).

The important thing is not what we did with the data, but what you make of it. Please tell us what you think – and how you would interpret this data – on the Participate! page.

To access our interactive visualizations of topics, chunked by type of article and year, go here.

To access our visualizations of tracing the topics over time, go here.