Posted on

Issues downloader for Bitbucket

This is an example of how to create a simple downloader of websites, which are behind login mechanism and further download them – for example bitbucket issues. While the target of your crawing could differ, and could go to xxx, my case was finite, which makes it slighlty easier, since i knew the base of URIs before i even started. Long story short – here is a code which could backup your pages while handling post data & cookies and is purely in Node.js.

What i needed for issues downloader:

  • nodejs runtime(of course)
  • some way to make html requests easy
  • entry point & ending point – basically where to login and when to end
  • https cerificates(explained later)
  • something for parsing html(jquery like cheerio)
  • some output channel for backup – preferably filesystem module(fs)


From my point of view, Node.js is for this job suitable as much as python or perl or whatever iterpreted language. The downside of node as “callback-providing-engine” is nested calls when you are lazy. Fortunately, this code won’t have more than 200 lines at most. First thing i needed was to create an authorization request, so let’s check the login page, since i don’t know if bitbucket has an API for that.



So according to developer tools, i see only next field, csrf field, email and password. The first one is probably not important, but just in case, let’s include it. The second one is token holder, which should be as important as email or password. As it is common, the csrf value is normally doubled in cookies.




That means that for authorizing, first i need to scrap the login page for required datas and then forge an HTTP POST request which should be the same as when using the login form on the login page. Lets write some code.

First – include required modules(of course – npm install xxx) – all of them right now, since i know i will need them. Everything should be self explanatory, maybe only rimraf. That is a module for recursive directory removal. Since i will be storing the html pages, i will probably need some directory and with directory i mean CLEAN directory.


Second step – retrieve the login page:

I started writing in functions, so it will be easier later when joining the code. As simply as it seems, the first function downloads the login page. Basically everything i need is the csrf token, which is duplicated – one instance is in the form, while the second is in cookies. Since i don’t want to parse the html only for one token, i grab it from the second occurence.


Third step – attempt to login. Since i have already checked the login page, i know that the url which handles the POST request from the form has the same url as login page requested through classic HTTP GET request(ie. in browser), so no need to change the url from previous step. Just create the same request, but change the method, add some data.

If you run this code, you will see, that something is not right. Even though the HTTP response status code is 200(ok), the body doesn’t contain the page, which you would normally get after login. When you debug the body variable, you will see the page and thankfully, the site shows you where exactly was the mistake. Well more like mistakes.

That probably means, that the request above didn’t send csrf in cookies, so the server couldn’t do csrf checkup – from request’s perspective. Request module don’t sent cookies between requests automatically, it has flag for it – named ‘jar’. It could be enabled globally – like this:

The second problem require to provide additional header. The idea behind this is generally to stop CSRF attacks. No problem, just change the request to this:

Note: when you run this code and print the output, the html would still contain the CSRF verification problem text. That is because the request would normally redirect the visitor. For the complete response, you can add  followAllRedirects : true(it will redirect the POST HTTP request to dashboard – parameter next – ‘/’)


Fourth step – download one issue page and save it. For this we have to store the project url and issue number – url of the issue is working even without the name in it(ie.<user>/<repo>/issues/<issueNumber>).

And since i am going to save the actual downloaded page into directory, let’s delete it at first and then create it anew.

OK, this is the basics. When you run the script, it will download the page into defined directory like /issues/1/1.html. This downloaded page is fortunately(thank you bitbucket) able to show everything what you normally see when you go through the login process in your browser and navigate to the url of the issue(css + js working). But you may see a little problem – if you aren’t currently logged in while checking the downloaded page, you won’t see the uploaded images for the issue – they are behind another redirect and physically uploaded somewhere in Amazon cloud.


So without auth session active, you won’t see uploaded images – and that will be in the part two of this article.

Note: offers export for your issues, but that works only if you have admin privileges. Otherwise there is not an API for you to casually download those issued like JSON or even raw besides doing it manually. Chrome store has some plugins, but seriously – you have to provide credentials, which i simply don’t want to.

Posted on

ES6 Proxy and catch-all method

As i talked about ecma2015/es6 with a few people, i’ve realized that many features are still a mystery in the community. One of them is Proxy object. Besides classic language changes there is a handful of goodies, which i particularly really like.


General Proxy

What is the proxy object? If you haven’t heard of it yet, let’s look at this example schema of classical Proxy server:

Network Proxy in real life

Uses of proxy servers are many, just picture this one: proxy lies between target and source of the request and is masking the source identity, so the target does not know about the real computer beyond the proxy. This concept is similar to NAT(although it is working diferently). As you can see from the schema above – there is a way for the proxy server to see and optionally alter the response or request. And for the goal of this article it’s enough to know, that javascript Proxy object is masking the call to some method/atribute of another object – basically is wrapping the target object. Since it is a wrapper and is “trapping” all calls, it can easily alter them.

In the contrast to some other languages, the javascript lacked this functionality a few years ago. It is said, that the proxy is similar to python’s get-attribute-access methods  or php’s catch-all method, or that it implements meta-programming or is basis for defensive programming. You can definitely use it for many benefits, that’s for sure.
(Note – this article is considering support in Node.js rather than browsers) The existence of Proxy objects is around for several years now(as a planned feature for ecma2016/es6) but wasn’t fully implemented even in Node.js until recently(6.0.0 it looks like). Even so, many enthusiasts has been using shims/transpilers/workarounds to be able to use Proxy objects before. Now they don’t have to complicate things. Back in the good ol’ days of Node.js version 0.12.x developers had another choice – to use old Proxies API, which was somehow not as flexible as new API. Even so, to get this done they had to enable harmony feature(s) to be able to run their code.


You can find an archive of this old API on mozilla docs.

Let’s talk about small difference between the old API and new API – the basic usage:


The new one is unifying the previous two methods. The target argument in new API accepts array, generic object, even proxy or function. The old API has been more prone to mistakes as could be seen below and this was probably the reason why they updated it. The handler argument alone is basically the same and could be used without changes in both APIs, while it is basically an object which holds functions – or better traps(full list ie. here). I will skip the basics and follow slightly more advanced usage, since i don’t believe that anybody would be using the proxies with only the handler(ie. without meaningful target / proto arguments).


This will be basically about implementing the __noSuchMethod__ in objects, which could be aliased as catch-all method. First – the old API:


However this code is working, it’s not exactly what i want – i don’t want to call the bundle function and create the instance of the Dummy inside it…let’s put it somewhere else.


Ok, that’s better. But now i am contradicting my initial statement, that i want to use both parameters. Moreover, this code is not the best, since the function in get trap would be created everytime when new Dummy would be created. One method of fixing this is to move this function out of constructor.


And there is another problem – it uses global variable, which would be overwritten everytime when new Dummy would be created in this scope. Somehow, the reference to created object has to be passes into the get function. But that won’t be the issue with the new API as this below would simply work, since we are not passing non-instantiated Dummy, but complete object.



The last code example is crude, yet effective way to implementing catch-all method in your objects which in turn could help you tremendously when creating large project with many models / controllers while fully using inheritance. Personally, i see a big difference while creating code purely in es6(August 2016 – however still using babel with es6). Then again – i would prefer even better usage – if not built-in method for each object, then at least directly extending Proxy prototype(or “class” if we are talking about es6) which is an idea for part 2 of this article.