1. HTTP content negotiation on AWS CloudFront Part 2

    My earlier post on HTTP content negotiation in AWS CloudFront covered support for negotiating the response encoding using the request’s Accept-Encoding header. This post builds on that by adding support for negotiating Content-Type using the request’s Accept header.

    As Ilya Grigorik laid out several years ago, content negotiation tricky to get right, and yet particularly important when serving images. This is even more relevant today, as recently the browser ecosystem’s support for WebP has taken a big step forward -- Edge just shipped support and work on Firefox support has resumed after a long hiatus. Furthermore, there continue to be interesting new image formats on the horizon such as AVIF and HEIF.

    There are several reasons why Content-Type negotiation is more difficult than Content-Encoding. First, the media types being negotiated are hierarchical, supporting a type and subtype model, e.g. image/png. In addition, wildcards are supported, e.g. image/*. Finally, unlike Accept-Encoding when browsers explicitly send all of the encodings that they support, browsers tend not to do this with the Accept header. For example, Firefox sends Accept: */* when requesting images. This gives the HTTP server no indication of what the browser actually supports -- by the RFC the server would be allowed to return any content at all. As a result, servers are typically either conservative, returning only formats which are highly likely to be supported like image/jpeg, or fall back to heuristics like User-Agent sniffing to detect specific browser builds which support a server-preferred content type.

    What is an HTTP server implementer to do? Apache’s mod_negotiation has a fairly sophisticated set of heuristics for supporting content negotiation which covers some of this including working around usage of overly-permissive wildcards.

    With AWS CloudFront, we can implement something similar to drive this process on Lambda@Edge. The code below is being used to serve this article, and will cause the image of a dog to be returned as WebP if your browser supports it.

    How does this work?

    Similar to my previous post, this uses the zero-dependency, MIT-licensed http-content-negotiation-js library to run the content negotiation process. This library implements all of the requisite media range parsing and semantics, as well as some of the heuristics from mod_negotiation. For example, it treats matches against a subtype wildcard as having an implicit q-value of 0.02 if none of the media ranges in the request have an explicit q-value specified.

    First, the SERVER_IMAGE_TYPES list of ValueTuple objects is created to represent our content type preferences for images. Note that we indicate that we have a slight preference for image/jpeg (q-value 0.99) over image/webp (q-value 0.98). This handles user-agents who do not send an explicit indication of support for one of the image types (e.g. Firefox sends Accept: */*). In this case, we’ll prefer the more conservative image/jpeg. For user-agents that do send specific image types (e.g. Chrome sends Accept: image/webp,image/apng,image/*,*/*;q=0.8), we’ll honor their request for a more specific type.

    Next, the request handler looks for request URLs ending in .jpg and interprets this as a request for an image, performing type negotiation. We then rewrite the URL for the upstream request with a new file extension based on the negotiated content type.

    Finally, it’s worth noting that while type and encoding negotiation are not mutually exclusive, it is generally not worthwhile spending CPU cycles to encode and decode images. Because of this, we only bother performing encoding negotiation if we’re not serving an image.

  2. HTTP content negotiation on AWS CloudFront

    HTTP content negotiation is a mechanism by which web servers consider request headers in addition to the the URL when determining which content to include in the response. A common use cases for this is response body compression, wherein a server may decide to gzip the content if the request arrived with an Accept-Encoding: gzip header.

    Support for content negotiation in HTTP servers is a mixed bag. Apache provides good built-in support for this. NGINX does not offer anything comparable although a rough approximation is possible via configuration directives. Unfortunately I can’t find documentation on IIS. AWS S3 static website hosting, which is used to serve this blog, provides no facility for this whatsoever.

    Over the past few years, CDNs have evolved to help address this problem in a few days.

    Most CDNs can compress content on the fly, even if the origin only serves uncompressed. Support for gzip is de rigueur, with CloudFront supporting Brotli as well. In practice, however, this can be limited. For example, AWS CloudFront won't compress anything under 1KB or over 10MB. In addition, compression is typically more effective the more CPU you spend on it though this effect is non-linear. For example, running gzip at level 9 can produce content that is 10s of percent smaller than level 1, but requires several times the processing power. As a result, CDNs are typically configured to run at fairly low optimization levels.

    Recently CDNs have also begun to allow applications to run business logic at the edge. CloudFlare workers, AWS Lambda@Edge and Fastly VCL are all examples of this.

    Felice Geracitano had the clever idea to use Lambda@Edge on AWS CloudFront to implement a bare bones content negotiation scheme for the purpose of supporting Brotli. While there are some issues with his implementation, the concept of performing content negotiation in JavaScript on the CDN and using the result to drive fetching a different resource from the origin is a powerful one.

    What does a good solution for this look like?

    • The origin server does not need to support content negotiation. This is both cheaper to operate and allows for offline processing of assets (e.g. to compress using gzip level 9 or Zopfli).
    • The content negotiation process should respect quality factors in the HTTP request headers, e.g. Accept-Encoding: gzip, br;q=0.9 indicates that if content encodings for both gzip and br are available, it prefers gzip.
    • The CDN should serve content with a correct Vary header to ensure that downstream caches are not confused by the content negotiation process.
    • The results of JavaScript content negotiation logic should be cached by the CDN, and only re-executed on CDN cache misses.

    Below is an implementation of this for AWS CloudFront, and is being used to handle traffic to https://blog.std.in. The code is MIT licensed and is derived from pgriess/http-content-negotiation-js.

    How does this work?

    The origin for https://blog.std.in/ is an S3 bucket configured for static website hosting. There are 3 different versions of each piece of content -- one un-processed, one compressed with gzip, and another compressed with Brotli. The compressed content lives in a shadow directory hierarchy under /gzip and /br respectively, allowing the path for compressed content to be computed by prepending the requisite directory.

    There are two handlers -- an origin request handler and an origin response handler. There are no viewer handlers, allowing CloudFront to skip this logic entirely when serving a cache hit. The origin request handler performs the content negotiation, parsing the Accept-Encoding header and comparing its requirements with what’s provided by the S3 bucket serving as the origin. It selects the best match and updates the URI to fetch from the origin. The origin response handler sets a Vary: Accept-Encoding header on the response indicating that content was negotiated based on the value of the Accept-Encoding header. The resulting response is then cached in CloudFront.

    Finally, the CloudFront distribution is configured with the “Cache Based on Selected Request Headers” setting set to include Accept-Encoding. This has the effect of CloudFront incorporating the browser’s Accept-Encoding header in its cache key when looking up a response. In addition, this prevents CloudFront from stripping the browser’s Accept-Encoding header before the origin request handler has a chance to execute.

    A bug in CloudFront?

    It is surprising to me that the Vary header is not being added automatically by CloudFront as enabling “Cache Based on Selected Request Headers” and adding Accept-Encoding explicitly indicates that the content for the given URL may vary by the value of this header. This seems like a pretty clear indication that CloudFront should be adding a Vary: Accept-Encoding header to the response automatically.

  3. Simple AWS Request Signing

    Amazon Web Services Amazon Web Services has been online for more than a decade, and now supports a dizzying array of services backed by a relatively easy-to-use REST API. Unfortunately, while the ecosystem of tools and libraries has expanded exponentially, working with these services tends to require a deep scaffolding of dependencies. Lately I've been playing with OpenWRT on my home network and wanted to make some AWS REST API requests. The device has a fairly generous 128MB of storage, but even so, pulling in the official awscli Python package with its dependencies weighs in at over 30MB, not including Python itself.

    Introducing aws4sign, a zero-dependency, MIT-licensed, single-file Python2 library and CLI tool that computes the AWS v4 signature.

    There are two ways to use this.

    Invoke as an executable

    The aws4sign.py file itself contains a simple __main__ that makes it easy to drive signature generation from anywhere that can invoke an executable.

    The following bash snippet calls the AWS Route53 hostedzone API using curl:

    The only thing of note here is that we end up calling aws4sign.py once for each header that we need to pass to curl, selecting the header to display using the -p option. If we omitted this option, all headers would be emitted (one per line), but parsing these is a bit more involved than desirable for such a simple example. Instead, we just emit a single header each time and use cut to grab its value. Note, however, that because the AWS signature algorithm uses a timestamp, we need to ensure that aws4sign.py has a constant notion of time across invocations. We do this by computing the current time up-front and passing it using the -t option.

    Integrated with Python code

    Copy and paste the single 100-line aws4_signature_parts() function into your code. Or integrate it into a module of your own. Whatever. No dependencies. No mucking with PIP. No incompatible licenses.

    The following code invokes the Route53 hostedzone API using urllib2:

    The first two return values from aws4_signature_parts() should probably be ignored by most users -- they are mostly in place to provide visibility into the signing process for validation and testing purposes.

    That's it! Happy signing.

  4. HTTP Response sizes and TCP

    It's no secret that reducing the size of HTTP responses can lead to performance improvements. Surprisingly, this is not a linear relationship; decreasing response size only slightly can dramatically reduce the time required to transfer the data.

    This document explains the throughput characteristics of an established TCP connection and how they can shape performance, often in surprising ways.

    Note: I making some simplifying assumptions here so that things are easier to model: a pre-existing, idle, TCP connection, and no packet loss. This effectively shows the best case scenario for how TCP can handle a response.

    A (brief) refresher on TCP

    TCP has several mechanisms that govern how fast the sender can send data.

    While a comprehensive understanding of TCP is way, way beyond the scope of this note (and not something the author would claim to possess anyway), the basic flow control mechanisms are not horribly complicated.

    First, a bit of vocabulary

    • sender -- the party sending data, e.g. an HTTP client when sending a request, or an HTTP server when sending a response; both parties in a TCP connection are senders and receivers
    • receiver -- the party receiving data, e.g. an HTTP client when receiving a response, or an HTTP server when receiving a request; both parties in a TCP connection are senders and receivers
    • data segment -- a single IP packet containing a TCP header and at least one byte of application data
    • congestion window (cwnd) -- the number of un-acknowledged data segments issued by the sender that can be in-flight at once; changes over time as the sender observes congestion on the connection
    • initial congestion window (IW or initcwnd) -- the initial value of cwnd for new connections; 10 is the standard value and what Facebook uses
    • maximum segment size (MSS) -- the largest possible size of a single data segment; negotiated during TCP handshaking
    • receive window (rwnd) -- the number of bytes that the receiver is willing to buffer for the application
    • round-trip time (RTT) -- the amount of time it takes a packet to travel from the sender to the receiver, and back again; colloquially known as "ping time"

    The maximum amount of data in-flight from the sender to the receiver is defined to be min(MSS * cwnd, rwnd).

    Each ACK for a data segment that arrives back at the sender frees up a slot in the cwnd. If the sender is unable to send additional data segments because there are already cwnd un-acknowledged segments in-flight, they can send out new data each time an ACK arrives. In addition, the cwnd is incremented by 1 each time an ACK is received, effectively doubling the cwnd value each time a flight of ACKs arrives for the outstanding data segments.

    Data flights and bandwidth

    The sender can have up to cwnd segments in-flight at a given time. Beyond that, the sender is stuck waiting for ACKs before it can emit additional segments. For large responses, this means that we typically see a pattern where cwnd segments are emitted all at once, an RTT passes, and cwnd ACKs arrive all at once. At this point, the sender can then send out another cwnd worth of segments. As a result, output tends to be bursty, with periodicity equal to the RTT.

    Recall that each ACK received increments the cwnd by 1. For a large response (i.e. the sender wants to send as much as possible at every opportunity), every data flight is twice as large as the one before it.

    How long does it take to send a response?

    If we're able to fit our response into the first data flight, we will require only a single round-trip to receive the response. The inverse is also true: if our response is only a single byte too large, the full response will not be available to the receiver for an additional RTT.

    This illustrates an interesting property of TCP's congestion control algorithm: when investigating latency it's useful to think of transmission size in terms of the number of data flights that are required to transmit it, rather than the absolute byte counts. That is, a single-byte response will take just as much time to receive as an cwnd * MSS response.

    Here is the amount of time required to transmit various data payloads on typical cell networks around the world. Assumptions: MSS of 1300, cwnd of 10 (the IETF recommended IW), and RTTs as shown for various countries.

    • 1xRTT (USA 150ms; India 1200ms; Brazil 600ms): 1 byte - 13,000 bytes
    • 2xRTT (USA 300ms; India 2400ms; Brazil 1200ms): 13,001 bytes - 39,000 bytes
    • 3xRTT (USA 450ms; India 3600ms; Brazil 1800ms): 39,001 bytes - 91,000 bytes
    • 4xRTT (USA 600ms; India 4800ms; Brazil 2400ms): 91,001 bytes - 195,000 bytes

    RTT values are hypothetical but realistic RTTs for cell network users in the respective countries.

    Ok, how can we speed things up?

    Using the above table, we can see that if we have a response that tends to be around 40k, the effort to reduce that below the 39k threshold will result in a 50% decrease in time to receive the data! Given that network time often dominates performance, this can be a significant win.

    If you are running your own server, you could also increase the IW value directly, though you really want to be sure you know what you're doing; it's easy to cause performance problems by introducing congestion into the network that would have otherwise been avoided. For kicks, here's a link showing the IW values for major CDN providers.

  5. How to stream MP3 audio from Rdio

    Recently, I was working on a personal project for which I wanted to stream audio from my Rdio (which, incidentally, is a great service) account. Unfortunately, the documented Rdio APIs don't provide a way to do this, instead providing streaming through a Flash player for web apps or compiled libraries for iOS and Android. I spent a bit of time reverse-engineering pieces of the Rdio ecosystem to figure out how to do this and thought I'd post the resulting recipe for how to do this in Python in case anyone is interested.

    First of all, apply for an Rdio API key.

    Next, you'll need to install a bit of software:

    • rdio-python package, which gives us access to the official Rdio REST API
    • PyAMF, which we use to make calls to the the un-documented Flash API
    • rtmpdump, which we use to stream the FLV content from the RTMP server. Note that versions after 2.1d don't work, as they refuse to talk to the server which they deem to be not genuine Adobe
    • ffmpeg, which we will use to transcode the FLV audio into MP3

    Once you've got all that installed, the process is relatively straight-forward: use the Rdio API to search for a track you want to download and grab a playback token for it; use the (un-documented) Flash API to retrieve parameters for constructing arguments to rtmpdump; run rtmpdump and pipe the output to ffmpeg to trans-code the FLV to MP3.

    Here's a proof-of-concept script that implements that process:

    To use use, invoke it with your API key, API secret and the query you wish to use for searching for songs. It will perform the search and spit out shell commands for fetching and trans-coding the song.

    I've wrapped this behavior up in a simple Python library and dumped it up on GitHub as pyrdiostream.

  6. NodeJS and V8

    [This is a response to this blog post by @olympum, with the rest of the thread being here and here.]

    Bruno makes three assertions about the relationship between NodeJS and V8: that V8 was not designed as a server-side engine, that V8's lack of threading inhibits adequate fault isolation, and that lack of explicit alignment between the V8 and NodeJS projects may lead to problems in the future.

    V8 was not designed for server-side execution

    I'm not really sure what this means, as Bruno doesn't provide any details on what he's concerned about.

    When compared with the JVM, which offers distinct client and server modes affecting primarily JIT compilation garbage collection strategies, V8 is indeed less full featured. In fact, in probably virtually all respects, the JVM is more mature and featureful than V8. However, that does not imply that it's a more appropriate choice for a server-side JavaScript runtime. After all, the JVM was designed from the ground up to run something vastly different than JavaScript. Though the list of alternative languages targeting the JVM is long and growing, it's unclear to me that supporting these is a priority for the JVM team (the invokedynamic instruction not withstanding).

    I would be very interested to see some benchmarks of workloads which are characteristic of server applications which compared Rhino, V8 and SpiderMonkey, et all. The results on arewefastyet.com are interesting, but don't include the JVM. Unfortunately, I'm not enough of a JavaScript expert to opine on the relevance of the benchmarks used there to a workload more classically "server". Perhaps someone else from the community could weigh in on this?

    It would be wonderful if someone at Joyent or Yahoo! could contribute a representative benchmark (it's open source!) and/or include a JVM-based engine to AWFY.

    V8's lack of threads inhibits fault isolation

    This is a specious argument, IMO.

    In a system which runs each request in its own thread, fault isolation is no better than a system which multiplexes requests over a single thread.

    Let's examine what happens when a fault occurs in a thread-per-request model. In languages without direct memory access (JavaScript, Java, etc), faults in threads are bubbled up to the top of the thread stack, say by virtue of an exception. Other threads continue to soldier on in their work, unaffected by this. I think this is what Bruno is referring to. Of course, it is possible for a problem in one thread to cause others to fail, particularly by exhausting available resources (e.g. file descriptors, memory, database connections, etc). In addition, bugs in application synchronization will cause other threads to deadlock, see corrupted state, etc.

    In a multiplexed model (e.g. NodeJS), things are largely the same: a fault will bubble up to the top of the event loop as an exception where it will be dealt with. Other requests are unaffected. Significantly, the lack of  parallelism within the same address space suggests that this model may in fact be less likely to fail than a threaded model (no locking bugs!).

    This argument boils down to which VM is more likely to crash of its own volition, taking all requests (be they in separate threads or not) with it. I don't have any information one way or the other on whether the JVM is more reliable than V8. Bruno, do you?

    The V8 team's commitment to NodeJS is uncertain

    While there doesn't seem to be a formal commitment here, anecdotal evidence suggests that the V8 team is interested in seeing server-side JavaScript (and NodeJS in particular) succeed.

    @jasonh makes several excellent points on the nature of this relationship (and Joyent's) in his blog post.

  7. Benchmarking Web Socket servers with wsbench

    Web Sockets are gaining traction as a realtime full-duplex communication channel, with several leading browsers (Chrome 5, Safari 5, Firefox 4, etc) having implemented support for some flavor of the protocol. Server support exists, but is not widespread, and is (entirely?) limited to specialized servers, with Socket.IO (based on NodeJS) being perhaps the most well-known. Finally, there appear to be no tools to do load or other testing on Web Socket services.

    Enter wsbench, a benchmarking tool for draft76 Web Socket servers featuring

    • Ability to generate a high degree of load using a single client process.
    • Easily change the number and rate of connections and number and size of messages to send/receive with each connection.
    • Ability to target a specific port, path, and Web Socket protocol in the target server.
    • The core request/session engine is easily scriptable using JavaScript.

    You'll find that wsbench is quite easy to use.

    Here, we open and close 1000 connections to a Web Socket server running on localhost, port 8000. We generate 50 connection requests per second.

        % time ./wsbench -c 1000 -r 50 ws://localhost:8000
        Success rate: 100% from 1000 connections
        real    0m20.379s
        user    0m1.340s
        sys     0m0.517s

    We can see a few interesting things in the output above.

    • The error rate is tracked by wsbench and reported at the end. Errors include failure to open a connection, failure to send a message, failure to close the connection cleanly, etc.
    • The wsbench process running on a late-model Mac Book is able to generate this load using less than 10% of the CPU.

    The above benchmarking run only tested establishing connections. We didn't send (or receive) any messages. By passing the -m 5 and -s 128 options to wsbench, we can send 5 128 byte messages per connection. Invoke wsbench with the -h option to see full usage:

    usage: wsbench [options] <url>
    Kick off a benchmarking run against the given ws:// URL.
    We can execute our workload in one of two ways: serially, wherein each
    connection is closed before the next is initiated; or in parallel, wherein
    a desired rate is specified and connections initiated to meet this rate,
    independent of the state of other connections. Serial execution is the
    default, and parallel execution can be specified using the -r <rate>
    option. Parallel execution is bounded by the total number of connections
    to be made, specified by the -c option.
    Available options:
      -c, --num-conns NUMBER   number of connections to open (default: 100)
      -h, --help               display this help
      -m, --num-msgs NUMBER    number of messages per connection (default: 0)
      -p, --protocol PROTO     set the Web Socket protocol to use (default: empty)
      -r, --rate NUMBER        number of connections per second (default: 0)
      -s, --msg-size NUMBER    size of messages to send, in bytes (default: 32)
      -S, --session FILE       file to use for session logic (default: None)

    Beyond performance testing, it can be useful to run load against a server continually to discover any resource leaks. This can be done by passing the -c 0 option -- 0 connections is interpreted as a special "infinite" value. For example -c 0 -r 100 will open/close 100 connections per second indefinitely (or until wsbench is terminated with a ^C).

    For information on how to script the core of wsbench, take a look at the Session Scripting section in the project page on GitHub. Because this tool is written entirely in JavaScript (using NodeJS), you'll find that its easily extensible using a familiar language.

  8. Using sendfile(2) with NodeJS

    NodeJS provides an interface to using the sendfile(2) system call. Briefly, this system call allows the kernel to efficiently transport data from a file on-disk to a socket without round-tripping the data through user-space. This is one of the more important techniques that HTTP servers use to get good performance when serving static files.

    Using this is slightly tricky in NodeJS, as the sendfile(2) call is not guaranteed to write all of a file's data to the given socket. Just like the write(2) system call, it can declare success after only writing a portion of the file contents to the given socket. This is commonly the case with non-blocking sockets, as files larger than the TCP send window cannot be buffered entirely in the TCP stack. At this point, one must wait until some of this outstanding data has been flushed to the other end of the TCP connection.

    Without further ado, the following code implements a TCP server that uses sendfile(2) to transfer the contents of a file to every client that connects.

    This is doing a couple of interesting things

    • We use the synchronous version of the sendfile API. We do this because we don't want to round-trip through the libeio thread pool.
    • We need to handle the EAGAIN error status from the sendfile(2) system call. NodeJS exposes this via a thrown exception. Rather than issuing another sendfile call right away, we wait until the socket is drained to try again (otherwise we're just busy-waiting). It's possible that the performance cost of generating and handling this exception is high enough that we'd be better off using the asynchronous version of the sendfile API.
    • We have to kick the write IOWatcher on the net.Stream instance ourselves to get the drain event to fire. This class only knows how to start the watcher itself when it notices a write(2) system call fail. Since we're using sendfile(2) behind its back, we have to tell it to do this explicitly.
    • We notice that we've hit the end of our source file when sendfile(2) returns 0 bytes written.
  9. More intelligent HTTP routing with NodeJS

    Earlier this week, I wrote an article for YDN covering some of the reasons why one might want to run a multi-core HTTP server in NodeJS and some strategies for intelligently allocating connections to different workers. While routing based on characteristics of the TCP connection is useful, the approach outlined in that post has a serious shortcoming - we cannot actually read any data off of the socket when making these decisions. Doing so before passing off the file descriptor would cause the worker process to miss critical request data, choking the HTTP parser.

    The above limitation precludes interrogating properties of the HTTP request itself (e.g. headers, query parameters, etc) to make routing decisions. In practice, there are a wide variety of use-cases where this is important: routing by cookie, vhost, path, query parameters, etc. In addition to cache affinity, this can provide some rudimentary forms of access control (e.g. by running each vhost in a process with a different UID or chroot(2) jail) or even QoS (e.g. by running each vhost in a process with its nice(2) value controlled).

    Naively we could use NodeJS as a reverse HTTP proxy (and a pretty good one, at that), but the overhead of proxying every byte of every request is kind of a drag. As it turns out, we can use file descriptor passing to efficiently hand off each TCP connection to the appropriate worker once we've read enough of the request to make a routing decision. Thus, once the routing process delegates a connection to a worker, that worker owns it completely and the routing process has nothing more to do with it. No juggling connections, no proxying traffic, nothing. The trick is to do this in such a way that allows the routing process to parse as much of the request as it needs to while ensuring that all socket data remains available to the worker.

    Step by step, we can do the following. Note that this does not work with HTTP/1.1 keep-alive, which multiplexes multiple requests over a single connection.

    1. Accept the TCP connection in the routing process
    2. Set up a data handler for the TCP connection that both retains a record of every byte received and uses a specially-constructed instance of the interruptible HTTP parser  (part of NodeJS core) to parse as much of the request as we need
    3. Once we've seen enough of the request, make a routing decision; here we just use the vhost specified in the request
    4. Hand off the file descriptor and all data seen thus far to the worker
    5. In the worker, construct a net.Stream connection around the received FD and use it to emit a synthetic 'data' event to replay data already read off of the socket by the routing process

    It's important to note that this does not rely on any modifications to the HTTP stack in the worker - just plane vanilla NodeJS. In order to do this, we have to recover from the fact that parsing the HTTP request in the routing process is destructive - it's pulling bytes off of the socket that are not available to the worker once it takes over the TCP connection. To make sure that the worker doesn't miss a single byte seen on the socket since its inception, we send over all data seen thus far and replay it in the worker using the synthetic 'data' event.

    Keep in mind that this code is a prototype only (please don't ship it - I've left out a lot of error handling for the sake of readability ;), but I thought it was interesting enough to share with a broader audience. This implementation takes advantage of the task management and message passing facilities of node-webworker. It should run out of the box on node-v0.1.100.

    Anyway, the key to this is being able to replay the socket's data in the worker. You'll notice in the gist above that we're calling net.Stream.pause() once we've received all necessary data in the routing process. This ensures that this process doesn't pull any more data off of the socket. If the kernel's TCP stack receives more data for this socket after we've paused the stream, it will sit in the TCP receive buffer waiting for someone to read it. Once the worker process ingests the passed file descriptor and inserts it into its event loop, this newly-arrived data will be read. In a nutshell, we use the TCP stack itself to buffer data for us. If we really wanted to be clever, we might be able to use recv(2) with MSG_PEEK to look at data arriving on the socket while leaving it for the worker, but I'm not sure how this would play with the event loop.

    Finally, while I think this is an interesting technique, it's worth noting that a typical production NodeJS deployment would be behind an HTTP load balancer anyway, to front multiple physical hosts for availability if nothing else. Many load balancers can route requests based on a wide variety of characteristics like vhost, client IP, backend load, etc. However, if one doesn't want/need a dedicated load balancer, or needs very application-specific logic to make routing decisions, I think the the above could be a useful tool.

  10. Design of Web Workers for NodeJS

    The node-webworker module aims to implement as much of the HTML5 Web Workers API as is practical and useful in the context of NodeJS. Extensions to the HTML5 API are provided where it makes sense (e.g. to allow file descriptor passing).


    Why bother to implement Web Workers for NodeJS? After all, child process support is already provided by the child_process module.

    • A set of standard (well, emerging standard anyway) platform-independent concurrency APIs is a useful abstraction. Particularly as HTML5 gains wider adoption and JavaScript developers are likely to familiar with Web Workers from doing browser development. The set of NodeJS primitives for managing processes, child_process provides a lot of utility, but is easily misunderstood by developers who have not developed for a UNIX platform before (e.g. why does kill() not kill my process?).In addition, the error reporting APIs in the Web Workers spec are more full-featured and JavaScript-specific than that provided natively by child_process (e.g. one can get a stack trace, etc).
    • Existing communication mechanisms with child processes involve communicating over stdin / stdout. Use of these built-in streams prevents sys.puts() and friends from working as expected. Further, these are opaque byte streams and require the application to implement their own framing logic to discern message boundaries.
    • HTML5 Shared Workers (also part of the same spec) provide a useful naming service for communicating with other workers by name. Without this, the application must maintain its own metadata for routing messages between workers. Note that shared workers are not yet implemented.


    The design that follows for Web Workers is motivated by a handful of underlying assumptions / philosophies:

    • Worker instances should be relatively long-lived. That is, it is not considered an important workload to be able to create and destroy thousands of workers as quickly as possible. Passing messages to existing workers to dispatch work items is favored over creating a new worker for each work item.
    • In the future, it will be desirable to run workers off-box, and to implement workers in other application frameworks / languages. This is particularly relevant in the choice of communication medium.
    • When practical, relevant standards and existing building blocks should be taken advantage of, particularly those that are geared towards JavaScript and/or HTTP. For example, this was one of the motivators for selecting Web Sockets as a messaging layer rather than rolling my own.

    Worker processes

    Each worker executes in its own self-contained node process rather than as a separate thread and V8 context within the master process.

    The benefits of this approach include fault isolation (any worker running out of memory or triggering some buggy C++ code will not take down other workers); avoiding the complexity of managing multiple event loops in a single process; and typical OSes are more likely to schedule different processes on different CPUs (this may not always happen for multiple threads within the same process), allowing the application to utilize multiple CPUs.

    Of course, there are drawbacks including the cost of context switching between workers being more expensive when using a process-per-worker model than it would be in a thread-per-worker model; passing messages between processes typically requires a data copy and always requires serializing data; and the overhead of spawning a new process.

    The worker context

    Each worker is launched by lib/webworker-child.js, which is handed paths to the UNIX socket to use for communication with the parent process (see below) and the worker application itself.

    This script is passed to node as the entry point for the process and is responsible for constructing a V8 script context populated with bits relevant to the Web Worker API (e.g. the postMessage(), close() and location primitives, etc). This also establishes communication with the parent process and wires up the message send/receive listeners. It's important to note that all of this happens in a context entirely separate from the one in which the worker application will be executing; the worker gets a seemingly plane-Jane Node runtime with the Web Worker API bolted on. The worker application doesn't need to require() additional libraries or anything.

    Inter-Worker communication

    The Web Workers spec describes a simple message passing API.

    Under the covers, this is implemented by connecting each dedicated worker to its parent process with a UNIX domain socket. This is lower overhead than TCP, and allows for UNIX goodies like file descriptor passing. Each master process creates dedicated UNIX socket for each worker the path /tmp/node-webworker-<pid>/<worker-id>, where <pid> is the PID of the process doing the creating, and<worker-id> is an ID of the worker being created. Although muddying up the filesystem namespace doesn't thrill me, this makes the implementation easier than listening on a single socket for all workers.

    Message passing is done over this UNIX socket by negotiating an HTML5 Web Socket connection over this transport. This is done to provide a reasonably-performant standards-based message framing implementation and to lay the groundwork or communicating with off-box workers via HTTP over TCP, which may be implemented in another application stack entirely (e.g. Java, etc). The overhead of negotiating and maintaining the Web Socket connection is 1 round trip for handshaking and the overhead of maintaining HTTP state objects (http_parser and such). The handshaking overhead is not considered an undue burden given that workers are expected to be relatively long-lived and the HTTP state overhead considered small.

    Message format

    The format of the messages themselves is JSON, serialized using JSON.stringify() and de-serialized using JSON.parse(). Significantly, the use of a framing protocol allows the Web Workers implementation to wait for an entire, complete JSON blob to arrive before invoking JSON.parse(). Although not implemented, it should be possible to negotiate supported content encoding (e.g. to support MsgPack, BERT, etc) when setting up the Web Socket connection. The built-in JSON object is relatively performant though node-msgpack is quite a bit faster, particularly when de-serializing.

    Each object passed to postMessage() is wrapped in an array like so [<msg-type>, <object>]. This allows the receiving end of the message to distinguish control messages (CLOSE, ERROR, etc) from user-initiated messages.

    Sending file descriptors

    As mentioned above, this Web Workers implementation can take advantage of node's ability to send file descriptors using UNIX sockets. As a nonstandard extension to the postMessage(obj [,<fd>]) API, an optional file descriptor can be specified. On symmetric API extension was made on the receiving end, where the onmessage(obj [,<fd>]) handler is passed a fd parameter if a file descriptor was received along with the specified message.

    Unfortunately, UNIX sockets seem to allow file descriptors to arrive out-of-band with respect to the data payload with which they were sent. To tie a received file descriptor to the message with which it was sent, all messages are wrapped in an array of the form [<fd-seqno>, <obj>], where <obj> is the object passed to postMessage(). The <fd-seqno> parameter starts off at 0 and is incremented for every file descriptor sent (the first file descriptor sent has a <fd-seqno> of 1). This provides the receiving end with enough metadata to tie out-of-band descriptors together with their originating message.