This series of articles loosely documents design choices and implementation issues I met during the making of the tln website. I learnt a few lessons along my improvised 9-week schedule, and I thought I could share them with fellow first-timers. I guess some technical points might also be of interest to more experimented developers.
Before what I consider to be real development, I took a few days in order to set up and feed the database which would eventually be used by the Critique application. This article reflects my thought process rather than a clean chronology, which actually does not exist because some parts happened simultaneously, and a few others required me to rework previous steps. Actually I believe there is no need to design things thoroughly beforehand, especially when one's in charge of everything and one still has to learn about some of the tools needed. Anyway, even the largest projects usually go through significant reworks as developers meet unexpected issues.
For what it's worth, here is a sorted list of the main tasks. I had to:
- decide what to show out of the numerous texts and lists I published on SensCritique;
- retrieve this personal data from SC (no public API, and I kept local copies of a few texts only);
- gather some more public data, like movie posters;
- format all this data so that it fit in a workable database.
Choosing a database solution
Out of several DBMS, I chose MongoDB. This arose mainly from my projected model for the
oeuvre objects, which were to be treated as one same collection across different pages (in order to filter movies to watch, or books I read, or any piece of art I'd written a comment on, etc.), yet could be quite different from one another. Indeed I did not intend for every
oeuvre to be illustrated, or to have an IMDb link, or to hold a comment from me, etc.
So I went for a NoSQL solution, which seemed more appropriate for simultaneously handling objects of somewhat different qualities. As to why MongoDB over other NoSQL programs: it seemed flexible enough, I didn't have specific needs like handling huge amounts of data or syncing up multiple machines, and it provided an easy way to import data from JSON documents.
I settled on 300px-tall images. As it happens, IMDb supports a media query system where the image dimensions and crop limits can be passed through an URL. Note the last part of the URL here. The height can be changed to UY300 and the crop part can be removed. I can't tell the benefit of developing this system, over serving images of different sizes (like HD vs. desktop vs. mobile) then making the browser resize them if need be, but, uh, okay.
Obviously, I had to retrieve each IMDb identifier to get to the web page referring to the appropriate poster. I tried to use the DuckDuckGo API through another Python library, unfortunately the project had not been updated for several years and produced very inconsistent results. So I went to make 2000+ automated movie title queries to Google and store every first result from HTML pages.
Once more I did this against terms of service, but screw this again, and for so many reasons. I just had to randomize a 1-2 second delay in order for the IP not to be tagged as potential DoS source for an hour or so. Before I understood this, Google's web service became conditioned to the validation of a few captchas. :')
I still had around 5% of results to correct by hand, because homonyms-like frequently outranked the film I was asking about. For instance, I would search for the 1995 Mamoru Oshii film by querying 'IMDb Ghost in the Shell', but Google would first return the IMDb link to the 2017 reboot (yuck). Automated queries were further hindered by the fact I had retrieved French titles from SensCritique, which often differed from both the original title and the international title best known to IMDb. Hence searching for Yasujirō Ozu's 'Le fils unique' (Hitori Musuko, or The Only Son) links to an obscure 2010 autobiography by Belgian director Miel Van Hoogenbemt...
Building the MongoDB database
At any rate, I was able to gather all the data I wanted. Depending on the source, I created or updated JSON documents (more scripting) so that they could be imported with the MongoDB built-in tool. Then, for automated overall modifications, I used MongoEngine but also PyMongo, probably because I had to make complex low-level queries. Subsequent one-time modifications were carried out either through the web interface (once I'd set it up) or the mongo shell.
I would advise against storing pictures directly inside the database. MongoDB allows to do this, but it seemed like a clumsy thing to do, and filename references to pictures stored in a system folder works fine. As I found out a few months afterwards, it was the easiest way to go for manipulating media files as the admin, and probably the only sane solution for serving media files properly (e.g. using client browser caching so as not to resend unmodified jpg files).