a ruby script to download pdfs from issu.com, when the download option has been disabled.
Problem: I wanted to read the Sheffield DocFest “Decision Makers Guide 2014” but can’t stand having to read of issu.com, just wanted a nice downloadable pdf, but the pdf download option that is generally under the share button was disabled.
Solution: I decide to see if I could extract the images that make up the flash object of the site and combine them into a pdf.
First thing first identifying the element id. clicking on white part of the page and inspecting the page in chrome, I found the id=readerreader
and under the object type="application/x-shockwave-flash" I found the documentId=140601160255-3a4c0f75ec731801ef369f5000f03104
Looking on stack overflow I worked out the following URL: http://image.issuu.com/140601160255-3a4c0f75ec731801ef369f5000f03104/jpg/page_5.jpg , where 140601160255-3a4c0f75ec731801ef369f5000f03104 is the id of the article, gives you each page as a jpg file that builds up the online magazine on issue.com named with an incremental count.
And what follows is the code I wrote in ruby and run from terminal
I used the Mechanize Gem to download the images, and the prawn gem to combine them into a pdf.
Let’s just break it down in pseudo code for a second, as we can see it articulates into two parts
Loop through page numbers substituting them in the URL, and save locally.
(after all without this script this is what I would have done manually)
combine all the downloaded images into a pdf
(again I would have probably done this with preview or automator on os x “manually”
1. Downloading the Images
Let’s look at the first part first, downloading the images.
print “images from 1 to 104 downloaded as jpg\n”
I knew the pages where 104, so I fought a loop would have been a good fit.
I then added the mechanise gem at the top with
within the loop I created a new mechanise object and assigned it to a variable agent
and also within the loop assigned the link to a variable link, where i is the number in the loop, and I’ve added a .to_s method to convert it to string to avoid any problem in the parsing of the URL.
with these two key elements in place, I went about using the mechanise method get on the link, and then saving it, giving it the page name, using the string interpolation of the loop number for the page name.
I then added a couple of comments to print out what was going on to get some feedback in terminal while the program was running, and here it is this first part of the problem solved.
2. Combining images into one pdf
Then for the second part of the problem, combining the images into a pdf I decided to use the gem prawn.
and the idea here is that first you generate a pdf, setting, page layout as portrait,
then I looped through the page number range, adding the jpeg files I saved in the first half of the problem to the pdf object I am creating.
after each page is created I used the method start_new_page on the pdf object.
And here it is the second half of the solution all together.
Generalising the solution
Last but not least I decided to change the code so that if I want to use it for another magazine on issue.com I can get the prompt from the terminal to add magazine name, page count, and document-id and therefore generalised the solution to the initial problem.
The resulting code is as follow (also on github):