Friday, December 28, 2012

Behind the browser

What actually happens behind the browser as we enter the URL that we need to access. I would like to explain the entire process as it happens Let's assume that we just we just did enter the url 'google.com' in our browser and pressed enter.


Step 1:
This step involves resolving the domain name to a IP address to which we could send our request for the webpage. This initial mapping of domain name to IP address is done via a application level protocol called the DNS or the Domain Name Service that maps the domain name to Ip address.
Initially the mapping is done as follows,

Browser cache -
                  The browser itself caches some of the DNS records, however the OS doesn't force any time to live or keep alive time for these DNS records.

OS Cache -  
                  In case the mapping is not available in Browser's cache, we search the OS cache, and this is done by 'gethostbyname' in Windows and in linux based machines.

Router's Cache -
                  In case the mapping is not available in OS's cache, we search the Router that we are connected to for DNS record that contains the mapping. Router also caches the DNS mapping for a certain period of time.

ISP's DNS server - 
                  In case the mapping is not available in Router's cache we directly contact the ISP's DNS server in order to get the mapping. Each Isp will have their own DNS server to help with the DNS requests.

Recursive DNS Search - 
                  In case the mapping is not available in ISP's DNS Server, we hit the root server and perform a recursive dns query



 This will provide us with the IP address we need to send our HTTP request to.

Step 2:
   We need to send the HTTP request to fetch the URL we need.
 [Note: I used the chrome's inbuilt developer's tool to view the request that was sent ]
 The browser forms the HTTP request and sends it to the IP address
GET http://google.com/ HTTP/1.1
Host: google.com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
X-Chrome-Variations: CLu1yQEIhLbJAQigtskBCKW2yQEIqLbJAQiptskBCLq2yQEIu4PKAQ== Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: NID=67=pJorFW7 [...]

The HTTP request contains headers that provide all the information a server might need. Some of the important Headers are User-Agent - which indicates the client initiating this request, Accept - The type of response for this request, Accept-Encoding- the encoding method to be used for the response, Cookies. To learn more about HTTP Headers [...]

Step 3: The response from the server for this request is as follows


HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Fri, 28 Dec 2012 20:07:34 GMT
Expires: Sun, 27 Jan 2013 20:07:34 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN


The HTTP 301 indicates that a page has permanently moved. By giving a response with HTTP 301, the browser can issue a new request with to the url http://www.google.com instead of  the old url http://google.com/ . The reason behind this is that for search engines, we do not need two different url's pointing to the same resource. This might lead to page hit being split and causing lower ranking for the same resource. Search engines can understand redirect and will update the page hits directly to the new url.

Also if the same content is cached with different url's, it becomes cache non-friendly as we have multiple copies of the same content.

 According to HTTP/1.1 Status Code Definitions section of the Hypertext Transfer Protocol -- HTTP/1.1 RFC 2616, Fielding, et al, the HTTP 301 is described as follows
The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise. The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s). If the 301 status code is received in response to a request other than GET or HEAD, the user agent MUST NOT automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued. Note: When automatically redirecting a POST request after receiving a 301 status code, some existing HTTP/1.0 user agents will erroneously change it into a GET request.
To learn more about HTTP 301 [...]
To learn more about HTTP 301 implementations in different platforms [...]

Step 4: The browser again issues a GET request to the new URL in the response


GET http://www.google.com/ HTTP/1.1
Host: www.google.com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
X-Chrome-Variations: CLu1yQEIhLbJAQigtskBCKW2yQEIqLbJAQiptskBCLq2yQEIu4PKAQ==
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: NID=67=pJorFW7Q [....]

Step 5: The response returned for this request is a HTTP 302 Found status code


HTTP/1.1 302 Found
Location: https://www.google.com/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=26fdae9d55c46b89:U=1ee158d5a202460c:FF=1:LD=en:NW=1:TM=1356672777:LM=1356725255:GM=1:SG=2:S=cDOs6qNFYgZICceY; expires=Sun, 28-Dec-2014 20:07:35 GMT; path=/; domain=.google.com
Date: Fri, 28 Dec 2012 20:07:35 GMT
Server: gws
Content-Length: 220
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN

The HTTP 302 indicates resources temporarily resides under a different URI. Since the redirection might be altered when performing future requests, the browser should only use the request URI for future purposes. The  new URI to be used for the current response in the location field in the response.

RFC states that the response should contain a short hyperlink text to the new URI.

Also states that , In case the 302 was generated to a request other GET or HEAD, the user agent should not redirect by itself.



Step 6:
Now the browsers knows the correct URI to send the GET request to, so it forms the new HTTP GET request and send it to the correct URI

GET http://www.google.com/complete/search?sugexp=chrome,mod=11&client=chrome&hl=en-US&q=goo&sugkey=AIzaSyCLlKc60a3z7lo8deV-hAyDU7rHYgL4HZg HTTP/1.1
Host: www.google.com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: NID=67=pJorFW7Qdq [...]


Step 7:
This request now reaches the server which will process it and provide it with a response.

The web server in itself comprises of the
Web server  - IIS , Apache  which handles all the incoming requests and host the web application. It receives the requests and forwards them to the appropriate web application. All the incoming requests are queued and each request can be handled in a separate thread or a single thread, which approach has its own advantages and  disadvantages [..] .
Web Application - These are the request handlers, that reads the request, observes the values of all the parameters, cookies. It process the request, also updates the databases and web server if needed and finally sends back a response to the client.

Step 8:
 The response generated by the server is as follows,

HTTP/1.1 200 OK
Date: Fri, 28 Dec 2012 20:07:35 GMT
Expires: Fri, 28 Dec 2012 20:07:35 GMT
Cache-Control: private, max-age=3600
Content-Type: text/javascript; charset=UTF-8
Content-Disposition: attachment
Content-Encoding: gzip
Server: gws
Content-Length: 173
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN

� ����� ��VJ��W

The content encoding tells the browser that the content is encoded using the gzip algorithm. The browser needs to decompress it. The Content-type: indicate what type of content is carries in the response body. Here it of type javascript. The header also specifies it cookies are to be set, how to cache the page, any other information if needs to be set.

Step 9:
Once the response is received the browser starts to render the HTML page, even though all the necessary  resources (such as images, flash, video, etc..) may not be available. As it renders, the browser understand that it needs other resources to build the page which is indicated by the URI.

When I tried to fetch the google homepage the some of the following were downloaded with individual GET requests


Request URL: https://ssl.gstatic.com/gb/images/k1_a31af7ac.png
Request Method: GET
Status Code: 200 OK (from cache)

Request URL: https://www.google.com/images/nav_logo114.png
Request Method: GET
Status Code: 200 OK (from cache)

Each of these request follow the same process of fetching the HTML page. If you see some of the static resources such as images could be cache. So that these resources could be served from the cache.

When this resources was initially request, the response would have a Expire header which specifies the cache life time for this resource. In addition each response will have ETag which serves to be a version number for these resources. When the browser wants to get a resource it check ETag, if the resource is already available then it stop the request. This saves a lot of bandwidth in downloading files that are already available.

Step 10:
Once the page is rendered, the browser may still try to communicate  with the server. This is done with the help of 'Ajax' which stands for Asynchronous JavaScript and XML. With Ajax we have communication between  server and client without have to render the whole page. The response to Ajax request from client could be XML or a Java script.
For example the Google+ notification, might constantly poll and check for updates, since the HTTP is a request response protocol, the server cannot push messages to the client without a HTTP request. Hence the client needs to poll the server for updates.

This pattern could be used for chat purposes, where we need chat notifications from the server. Long polling is one of the techniques used to poll the server. [....]


This is the way the browser interacts with the server in order to render your page.




NOTE: Thanks to Igor Ostrovsky and his wonderful blog which helped me write this article



References:
http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/
http://en.wikipedia.org/wiki/Push_technology
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
http://en.wikipedia.org/wiki/Domain_Name_System




No comments:

Post a Comment