Pietro Passarelli

issu.com pdf downloader

Problem: I wanted to read the Sheffield DocFest “Decision Makers Guide 2014” but can’t stand having to read of issu.com, just wanted a nice downloadable pdf, but the pdf download option that is generally under the share button was disabled.

Solution: I decide to see if I could extract the images that make up the flash object of the site and combine them into a pdf. First thing first identifying the element id. clicking on white part of the page and inspecting the page in chrome, I found the id=readerreader

and under the object type="application/x-shockwave-flash" I found the documentId=140601160255-3a4c0f75ec731801ef369f5000f03104

Looking on stack overflow I worked out the following URL: http://image.issuu.com/140601160255-3a4c0f75ec731801ef369f5000f03104/jpg/page_5.jpg , where 140601160255-3a4c0f75ec731801ef369f5000f03104 is the id of the article, gives you each page as a jpg file that builds up the online magazine on issue.com named with an incremental count.

And what follows is the code I wrote in ruby and run from terminal

$ruby issuu_downloader.rb

I used the Mechanize Gem to download the images, and the prawn gem to combine them into a pdf.

require 'mechanize'
require 'prawn'

########1.Looping through the page to download them
for i in 1..104
  print "downloading\tpage n #{i}\n"
agent = Mechanize.new
link = "http://image.issuu.com/140601160255-3a4c0f75ec731801ef369f5000f03104/jpg/page_#{i.to_s}.jpg"
agent.get(link).save "page_#{i.to_s}.jpg"
print "downloaded\tpage n #{i}\n"
end

print "images from 1 to 104 downloaded as jpg\n"

########2. Combine all the images into a pdf

Prawn::Document.generate("DocFest_Decision Makers Guide 2014.pdf", :page_layout => :portrait) do |pdf|
   # pdf.text("Hello Prawn!")

(1..104).each do |i|
    pdf.image "page_#{i.to_s}.jpg", :at => [0,750], :width => 530
pdf.start_new_page
end#end of loop
end#end of prawn

Let’s just break it down in pseudo code for a second, as we can see it articulates into two parts

  1. Loop through page numbers substituting them in the URL, and save locally. (after all without this script this is what I would have done manually)

  2. combine all the downloaded images into a pdf (again I would have probably done this with preview or automator on os x “manually”

1. Downloading the Images

Let’s look at the first part first, downloading the images.

require 'mechanize'

########Looping through the page to download them
for i in 1..104
  print "downloading\tpage n #{i}\n"
agent = Mechanize.new
link = "http://image.issuu.com/140601160255-3a4c0f75ec731801ef369f5000f03104/jpg/page_#{i.to_s}.jpg"
agent.get(link).save "page_#{i.to_s}.jpg"
print "downloaded\tpage n #{i}\n"
end

print “images from 1 to 104 downloaded as jpg\n”

I knew the pages where 104, so I fought a loop would have been a good fit.

for i in 1..104
   #change the URL with string interpolation #{i} to change the page number
end 

I then added the mechanise gem at the top with

require 'mechanise'

within the loop I created a new mechanise object and assigned it to a variable agent

agent = Mechanize.new

and also within the loop assigned the link to a variable link, where i is the number in the loop, and I’ve added a .to_s method to convert it to string to avoid any problem in the parsing of the URL.

link = "http://image.issuu.com/140601160255-3a4c0f75ec731801ef369f5000f03104/jpg/page_#{i.to_s}.jpg"

with these two key elements in place, I went about using the mechanise method get on the link, and then saving it, giving it the page name, using the string interpolation of the loop number for the page name.

"page_#{i.to_s}.jpg"
agent.get(link).save "page_#{i.to_s}.jpg"

I then added a couple of comments to print out what was going on to get some feedback in terminal while the program was running, and here it is this first part of the problem solved.

require 'mechanize'

########Looping through the page to download them
for i in 1..104
  print "downloading\tpage n #{i}\n"
agent = Mechanize.new
link = "http://image.issuu.com/140601160255-3a4c0f75ec731801ef369f5000f03104/jpg/page_#{i.to_s}.jpg"
agent.get(link).save "page_#{i.to_s}.jpg"
print "downloaded\tpage n #{i}\n"
end

print "images from 1 to 104 downloaded as jpg\n"

2. Combining images into one pdf

Then for the second part of the problem, combining the images into a pdf I decided to use the gem prawn. require ‘prawn’ and the idea here is that first you generate a pdf, setting, page layout as portrait,

Prawn::Document.generate("DocFest_Decision Makers Guide 2014.pdf", :page_layout => :portrait) do |pdf|

then I looped through the page number range, adding the jpeg files I saved in the first half of the problem to the pdf object I am creating.

(1..104).each do |i|
    pdf.image "page_#{i.to_s}.jpg", :at => [0,750], :width => 530

after each page is created I used the method start_new_page on the pdf object. pdf.start_new_page And here it is the second half of the solution all together.

require 'prawn'

########to combine all the images into a pdf

Prawn::Document.generate("DocFest_Decision Makers Guide 2014.pdf", :page_layout => :portrait) do |pdf|

(1..104).each do |i|
    pdf.image "page_#{i.to_s}.jpg", :at => [0,750], :width => 530
pdf.start_new_page
end#end of loop
end#end of prawn

Generalising the solution

Last but not least I decided to change the code so that if I want to use it for another magazine on issue.com I can get the prompt from the terminal to add magazine name, page count, and document-id and therefore generalised the solution to the initial problem. The resulting code is as follow (also on github):

require 'mechanize'
require 'prawn'
prompt = "> "

puts "What is the name of the magazine you'd like to download from issuu.com? ps: this will be the name of your pdf file\n"
print prompt
magazine_name = gets.chomp


puts "How many pages does it have?\n ie 104\n"
print prompt
page_number = gets.chomp

puts "document Id? \n to get the 'document-id' inspect page in chrome,\n search for document-id and paste here,\n ie 140601160255-3a4c0f75ec731801ef369f5000f03104\n"
print prompt
document_id = gets.chomp

for i in 1..page_number.to_i
  print "downloading\tpage n #{i}\n"
  agent = Mechanize.new
  link = "http://image.issuu.com/#{document_id.to_s}/jpg/page_#{i.to_s}.jpg"
  agent.get(link).save "page_#{i.to_s}.jpg"
  print "downloaded\tpage n #{i}\n"
end

print "images from 1 to #{page_number.to_s} downloaded as jpg\n"

########to combine all images into a pdf

Prawn::Document.generate("#{magazine_name}.pdf", :page_layout => :portrait) do |pdf|

  for i in 1..page_number.to_i
      pdf.image "page_#{i.to_s}.jpg", :at => [0,750], :width => 530
      pdf.start_new_page
  end#end of loop
end

print "images from 1 to #{page_number.to_s} combined into pdf \n"

########to delete all images, once pdf as been created, to clean up a bit
for i in 1..page_number.to_i
  File.delete("page_#{i.to_s}.jpg")
end#end of prawn

print "images from 1 to #{page_number.to_s} deleted \n"
# print "your pdf #{magazine_name}.pdf is in: \n #{Dir.pwd}"