Thursday, March 31, 2011

Using curl to download password-protected pages

We are rebuilding one of our sites witch was used as a blog in the past. We decided to republish one of it's posts here.

Sometimes, for the research we need to download massive amounts of content from password-protected websites. It can be done manually using Internet Explorer, though it becomes a problem when downloading large amounts of pages. Manually, it is a tedious task, which can be completely automated.

Automating the download process of massive content

For my work, I use the following tools: firefox and curl. Both of these tools are available for Windows and Linux. Curl is a web page downloading tool. It is similar to the well known wget tool. I guess I do not need any words for Firefox. I can say that it is much better alternative to Internet Explorer.

Basically, establish a session with the password-protected website using Firefox. Use the resulting cookie in a standalone curl application to fetch website pages.

Step 1. (optional) Clean Cookies.

Clean all cookies to reduce load as follows: Firefox Menu -> Tools -> Clear Private Data. The Clear Private Data window appears. Select the Cookies checkbox as follows:

Step 2. Log in to the website.

Go to password-protected website and fill in your username and password and login.
For example,

Step 3. Getting the cookie.

Open Firefox Preferences Window and select the Privacy tab in Firefox Menu -> Edit -> Preferences -> Privacy tab:

Next, click the Show Cookies button. Select the required website and open the
Cookie window.

Pay attention to the cookie's name and the content fields, we will use them in the next step.

Step 4. Create the cookie container.

Cookie container is a special file used to store cookie values. curl can use this file to get access to your established session (you have signed in to the website, right?).

Open your favorite text editor and create the cookie.txt file: FALSE / FALSE 0 cookie-name cookie-content

All fields must be separated by tabs.

For example: FALSE / FALSE 0 PHPSESSID a571d80ccf683df8da9031ada698336e

Step 5. Download the website.

We are almost done! Download the required pages. Run the following command:
curl --cookie-jar cookie.txt -b cookie.txt

The cookie prompts the website to accept you as a legitimate user!

Now you can continue to download all required content by running the shell command.
For example,
Command for downloading GreenSQL database names:

* curl --cookie-jar cookie.txt -b cookie.txt
* curl --cookie-jar cookie.txt -b cookie.txt
* curl --cookie-jar cookie.txt -b cookie.txt

No comments:

Post a Comment