Aug
04
2016
by admin

Issues downloader for Bitbucket

This is an example of how to create a simple downloader of websites, which are behind login mechanism and further download them – for example bitbucket issues. While the target of your crawing could differ, and could go to xxx, my case was finite, which makes it slighlty easier, since i knew the base of URIs before i even started. Long story short – here is a code which could backup your pages while handling post data & cookies and is purely in Node.js.

What i needed for issues downloader:

  • nodejs runtime(of course)
  • some way to make html requests easy
  • entry point & ending point – basically where to login and when to end
  • https cerificates(explained later)
  • something for parsing html(jquery like cheerio)
  • some output channel for backup – preferably filesystem module(fs)

 

From my point of view, Node.js is for this job suitable as much as python or perl or whatever iterpreted language. The downside of node as “callback-providing-engine” is nested calls when you are lazy. Fortunately, this code won’t have more than 200 lines at most. First thing i needed was to create an authorization request, so let’s check the login page, since i don’t know if bitbucket has an API for that.

 

bitbucket-login

So according to developer tools, i see only next field, csrf field, email and password. The first one is probably not important, but just in case, let’s include it. The second one is token holder, which should be as important as email or password. As it is common, the csrf value is normally doubled in cookies.

 

cookies

 

That means that for authorizing, first i need to scrap the login page for required datas and then forge an HTTP POST request which should be the same as when using the login form on the login page. Lets write some code.

First – include required modules(of course – npm install xxx) – all of them right now, since i know i will need them. Everything should be self explanatory, maybe only rimraf. That is a module for recursive directory removal. Since i will be storing the html pages, i will probably need some directory and with directory i mean CLEAN directory.

 

Second step – retrieve the login page:

I started writing in functions, so it will be easier later when joining the code. As simply as it seems, the first function downloads the login page. Basically everything i need is the csrf token, which is duplicated – one instance is in the form, while the second is in cookies. Since i don’t want to parse the html only for one token, i grab it from the second occurence.

 

Third step – attempt to login. Since i have already checked the login page, i know that the url which handles the POST request from the form has the same url as login page requested through classic HTTP GET request(ie. in browser), so no need to change the url from previous step. Just create the same request, but change the method, add some data.

If you run this code, you will see, that something is not right. Even though the HTTP response status code is 200(ok), the body doesn’t contain the page, which you would normally get after login. When you debug the body variable, you will see the page and thankfully, the site shows you where exactly was the mistake. Well more like mistakes.

That probably means, that the request above didn’t send csrf in cookies, so the server couldn’t do csrf checkup – from request’s perspective. Request module don’t sent cookies between requests automatically, it has flag for it – named ‘jar’. It could be enabled globally – like this:

The second problem require to provide additional header. The idea behind this is generally to stop CSRF attacks. No problem, just change the request to this:

Note: when you run this code and print the output, the html would still contain the CSRF verification problem text. That is because the request would normally redirect the visitor. For the complete response, you can add  followAllRedirects : true(it will redirect the POST HTTP request to dashboard – parameter next – ‘/’)

 

Fourth step – download one issue page and save it. For this we have to store the project url and issue number – url of the issue is working even without the name in it(ie. https://bitbucket.org/<user>/<repo>/issues/<issueNumber>).

And since i am going to save the actual downloaded page into directory, let’s delete it at first and then create it anew.

OK, this is the basics. When you run the script, it will download the page into defined directory like /issues/1/1.html. This downloaded page is fortunately(thank you bitbucket) able to show everything what you normally see when you go through the login process in your browser and navigate to the url of the issue(css + js working). But you may see a little problem – if you aren’t currently logged in while checking the downloaded page, you won’t see the uploaded images for the issue – they are behind another redirect and physically uploaded somewhere in Amazon cloud.

 

So without auth session active, you won’t see uploaded images – and that will be in the part two of this article.

Note: bitbucket.org offers export for your issues, but that works only if you have admin privileges. Otherwise there is not an API for you to casually download those issued like JSON or even raw besides doing it manually. Chrome store has some plugins, but seriously – you have to provide credentials, which i simply don’t want to.