tln devlog [2.0/3] – Preparing for a MongoDB media database

This series of articles loosely documents design choices and implementation issues I met during the making of the tln website. I learnt a few lessons along my improvised 9-week schedule, and I thought I could share them with fellow first-timers. I guess some technical points might also be of interest to more experimented developers.

Before what I consider to be real development, I took a few days in order to set up and feed the database which would eventually be used by the Critique application. This article reflects my thought process rather than a clean chronology, which actually does not exist because some parts happened simultaneously, and a few others required me to rework previous steps. Actually I believe there is no need to design things thoroughly beforehand, especially when one's in charge of everything and one still has to learn about some of the tools needed. Anyway, even the largest projects usually go through significant reworks as developers meet unexpected issues.

For what it's worth, here is a sorted list of the main tasks. I had to:

  • decide what to show out of the numerous texts and lists I published on SensCritique;

  • retrieve this personal data from SC (no public API, and I kept local copies of a few texts only);

  • gather some more public data, like movie posters;

  • format all this data so that it fit in a workable database.

Choosing a database solution

Out of several DBMS, I chose MongoDB. This arose mainly from my projected model for the oeuvre objects, which were to be treated as one same collection across different pages (in order to filter movies to watch, or books I read, or any piece of art I'd written a comment on, etc.), yet could be quite different from one another. Indeed I did not intend for every oeuvre to be illustrated, or to have an IMDb link, or to hold a comment from me, etc.

So I went for a NoSQL solution, which seemed more appropriate for simultaneously handling objects of somewhat different qualities. As to why MongoDB over other NoSQL programs: it seemed flexible enough, I didn't have specific needs like handling huge amounts of data or syncing up multiple machines, and it provided an easy way to import data from JSON documents.

Crawling SensCritique

For retrieving data, I used Scrapy, a Python library. Browsing through the source of the web page which holds the data in order to tweak the scripts from the docs quickly gave fine results. I was only able to collect back public information like daily comments I wrote on movies, year-end lists, etc. However, I had to sacrifice secondary info like watching dates, because it was only displayed when I was logged in to my account, and I was not able to set up HTTPS sessions with Scrapy. The library seems to be able to handle simple login forms, but JavaScript widgets like SensCritique used required trickier moves I decided not to try to pull off.

Crawling IMDb

I turned to Scrapy again for adding certain pieces of information I wanted to display; mostly this was about showing movie posters. I found out there was no public database for this, so I resolved to query IMDb for movie posters. Further research showed that it might be contrary to their terms of use, but other people already seem to run commercial services on such activity, and making image queries for five minutes or so won't put a strain on their servers, and I'm not making money out of it, so screw this.

I settled on 300px-tall images. As it happens, IMDb supports a media query system where the image dimensions and crop limits can be passed through an URL. Note the last part of the URL here. The height can be changed to UY300 and the crop part can be removed. I can't tell the benefit of developing this system, over serving images of different sizes (like HD vs. desktop vs. mobile) then making the browser resize them if need be, but, uh, okay.

Crawling Google

Obviously, I had to retrieve each IMDb identifier to get to the web page referring to the appropriate poster. I tried to use the DuckDuckGo API through another Python library, unfortunately the project had not been updated for several years and produced very inconsistent results. So I went to make 2000+ automated movie title queries to Google and store every first result from HTML pages.

Once more I did this against terms of service, but screw this again, and for so many reasons. I just had to randomize a 1-2 second delay in order for the IP not to be tagged as potential DoS source for an hour or so. Before I understood this, Google's web service became conditioned to the validation of a few captchas. :')

I still had around 5% of results to correct by hand, because homonyms-like frequently outranked the film I was asking about. For instance, I would search for the 1995 Mamoru Oshii film by querying 'IMDb Ghost in the Shell', but Google would first return the IMDb link to the 2017 reboot (yuck). Automated queries were further hindered by the fact I had retrieved French titles from SensCritique, which often differed from both the original title and the international title best known to IMDb. Hence searching for Yasujirō Ozu's 'Le fils unique' (Hitori Musuko, or The Only Son) links to an obscure 2010 autobiography by Belgian director Miel Van Hoogenbemt...

Building the MongoDB database

At any rate, I was able to gather all the data I wanted. Depending on the source, I created or updated JSON documents (more scripting) so that they could be imported with the MongoDB built-in tool. Then, for automated overall modifications, I used MongoEngine but also PyMongo, probably because I had to make complex low-level queries. Subsequent one-time modifications were carried out either through the web interface (once I'd set it up) or the mongo shell.

I would advise against storing pictures directly inside the database. MongoDB allows to do this, but it seemed like a clumsy thing to do, and filename references to pictures stored in a system folder works fine. As I found out a few months afterwards, it was the easiest way to go for manipulating media files as the admin, and probably the only sane solution for serving media files properly (e.g. using client browser caching so as not to resend unmodified jpg files).