Next: Using NAT Up: Title Previous: Virtualizing the Network

Example Implementation

Subsections

Examining the Premises

I implemented everything inside the Linux kernel because:

The firewalling code, which is similar in that it works on the same objects, IP packets and their data, has already been integrated into the kernel so I thought it appropriate to do the same with NAT.
Using a kernel interface to manipulate each single packet in user space takes a lot of time compared to direct kernel level access. The entire packet (not just the header, see Section ) needs to be copied twice, we cannot simply pass a pointer.

Some protocol specific things like rewriting DNS data could be handled by user level daemons, using Linux' local redirect function, for example.

In order to make the implementation as flexible as possible and to avoid interfering with the rest of the kernel whereever possible I have not integrated NAT into an existing framework inside the kernel, like the one used for registering new firewall modules. This would have been a cleaner approach, but this part of the (2.0) Linux kernel has some inconsistencies. Masquerading may work well or not, but the way it is implemented in the kernel does not encourage others to do the same with their code. The firewall code contains a somehow generic interface to register new firewall modules, but much is missing. For example, masquerading needs to keep state information about all connections, the firewall should do the same (which it does not) and NAT in general needs state information. So the thought to have this information collected, stored and managed by the same generic routines, made available for still other future kernel enhancements, has come to me. After thinking about it for a while I decided that it would be too great a task to solve in the time available. It would have involved not just designing this generic routines and data structures, but also rewriting large parts of the firewall- and masquerading code. Implementing a better firewall with state information alone is a task currently tackled by a group of people on the Internet, not to mention that changing the now working masquerading code, which is quite complex, is far more than just a month's work. Not to mention the famous saying ``Never touch a running system'', and that was what I wanted, to have a running (NAT)system, not just framework. This reveals my attitude towards the different software development models: I prefer RAD because I want to have a running system to experiment with rather than playing with concepts that may work or not. That does not mean I started coding immediately. I discussed many aspects of NAT for more than two months, thought it over for another month or two and then I wrote the first line of code. There is a point were mere thoughts do not get us much further, where experiment must start. This is the same as in physics -- experimentalists against theorists -- and as history shows we began becoming as successful as never before in human history when both directions were taken. I thought I had to say this because some people already have asked me why I did not implement it this or that way, to integrate it with firewalling or the like - in general, why did I add a new layer. The simple answer is, simply because I wanted to have a result and did not want to think about anything else than NAT more than necessary. I thought it would be necessary to see if and how NAT works in the first place, to see the implications and the restrictions, before we can go the next step to integrate it with other features. It is just like in physics, like looking for the great theory of everything: for the great picture we have to know the components first.

An interesting issue when doing kernel programming is debugging. No code will work exactly as expected, as every programmer knows. Although I had read about a tool which should allow me some basic debugging on kernel code I did not trust it and used break points and the reset button on the computer instead. It worked quite well, especially after getting used to kernel programming. The first time I encountered a NULL-pointer reference I spent almost a week hunting the bug, by the end of the project finding even more subtle such NULL-references did not take me longer than ten minutes. This is interesting, because I did not really know more about the kernel and about C than when I started, at least nothing that could be useful for finding this kind of bug, so it was my feeling that had developed rather than my knowledge that guided me. This side note shall just show what a powerful ally the subconscious mind is, which stores and processes lots of unbelievably fuzzy data, something the artificial intelligence people have been working on for years.

I started using the new 2.1.X-kernel but soon felt it would be better to use a stable kernel. I could never be sure if a kernel panic was caused by my code or by experimental code contributed by others who worked on different parts of the system. Going back to 2.0 I realized that 2.1 kernel code was significantly cleaner in some places, but despite that I chose stability of the development system over new features and cleaner implementation. In 2.1 even when the kernel worked well I could be sure that there was always some tool that did not work very well with this new kernel.

The Core NAT Implementation

As discussed above NAT should be a layer encapsulating the kernel. Therefore all other network kernel functions, and thereby all user level programs, would not be able to see the real (IP-) world but only the addresses made visible by NAT. Whether these are real IPs or translated IPs does not matter, they all came from the NAT layer. This makes it possible to have the kernel (and all programs) live in a virtual address space that does not exist in the real world. Only NAT will know the reality. An example of a setup that can be supported by this implementation that a ``regular'' NAT that does translation inside the kernel cannot do is when you have networks that use the same IP address space and want to communicate with one another. I do not know any such setup and never heard of one, but it can be imagined there are two independently administrated networks that once have been built using the same RFC 1918 address space for both. We can connect them to a different interface of the NAT-router, set up rules to translate the IPs depending on which interface the packets arrived on, and the two networks can exchange IP packets.

To achieve independence and flexibility the changes made to the kernel itself are minimal. Most of them do some initialization on kernel startup or are hooks where the real NAT module can register itself with the kernel. The only real NAT code in the kernel are calls to a function in the module that examines each IP packet. These calls take place right when an IP packet comes in and right before an IP packet is going to be transmitted, and are done only if the NAT module has been inserted into the kernel, i.e. without the module the system will behave just as if it had been compiled without NAT support.

NAT-Rules can be bound to the inbound or the outbound direction. If that would not be possible NAT could not be laid around the entire kernel and we could only translate incoming or outgoing packets, loosing flexibility.

I reused large portions of the packet matching code of Linux' firewalling. Here the above discussion about why it has been implemented the way it is and not integrated with other parts of the kernel could be continued, because one might rightly argue that packet matching is the same for firewalling and for NAT. Again, to cite myself, I wanted to implement and test NAT in the time I had for the project, and finding a good and general solution for integrating parts of the kernel was too big a task, especially since

the packet matching code is not identical, there are many special cases that need to be considered for NAT
implementing NAT as a further firewall in the firewall chain would have made NAT less flexible. Even if it could still fulfill all real-world tasks that way that would not have been a good way for my experiments.

Above I mentioned we need to keep information about fragments if we want to do port dependent NAT. Linux has the ability to defragment all packets it routes. This is even more than we need since it's only necessary when we need the port information for NAT. This way we do not need to keep fragment information. However, it does not completely work: since NAT has been laid around the kernel encircling everything it also encircles the defragmenting code, i.e. NAT for incoming packets is called before defragmentation can be done. That is why port dependent translation can only be done reliably for outgoing packets. Changing the defragmenting code so that it gets incoming packets before NAT does should be easy, but I did not bother to do it.

The data structure used to store the various address translation rules is a list. Each rule specifies some criteria a packet has to match in order to be translated using the rule's NAT-IPs. These criteria are source and destination IP and mask, source and destination port, protocol (UDP,ICMP,TCP) and the interface where the packet has arrived on or is going to be sent through. It is also checked if the packet matches the reverse of the current rule in order to enable bidirectional rules. If any of the data used for packet matching is left empty every packet matches this criteria. Because order is important there was no choice but to use a linear list. There are also skip-rules which, when matched by the packet's IP header, cause the NAT function to skip this packet. Each rule has a unique number so that new rules can be inserted anywhere into the chain. This is another example where NAT is more flexible than the current firewalling code that could have been used, and in order to still have this flexibility when integrating NAT into existing code I would have had to rewrite the other code as well, which would have been too much work.
Since I wanted the implementation to be as flexible as possible I had to find a way to allow storing information needed for such different kinds of NAT as static NAT, dynamic NAT and virtual servers all in the same structure, i.e. the NAT rules should look equal whatever kind of NAT they represented. Therefore the the structure that stores exactly one rule contains all the information needed for packet matching, some additional information like flags, packet and byte counters and the rule's identification number. The various information that is different in respect to data types and amount of data for each kind of NAT gets stored in dynamically allocated memory and the rule only has a pointer to the start of the area where this information lives.

The following example shows some static NAT rules where all the necessary information is included in the rules and a virtual server rule (the second rule). The virtual server rule needs additional dynamic data, which are a list of all the IPs where all requests to the virtual server should be redirected to and a (more dynamic) list of clients and to what server they have been connected to. The list of clients may become quite large so an appropriate data structure is necessary to ensure a minimum overhead for searching this list. A hash or a (balanced) binary tree would work. I have used a linear list, in contradiction to my own proposal. The reason was because this implementation served for experiments and I thought it easier to track a linear list in case something went wrong and I had to debug the code. It can easily replaced by a more sophisticated data structure allowing much faster search algorithms.

Another puzzle to solve was how to get the NAT rules specified by the user into the NAT module, which is part of the kernel. Linux has several interfaces for user- to kernel-communication. One that is special to networking is the call to the function setsockopt() . The Linux firewalling administration tool ipfwadm written by Jos Vos (jos@xos.nl) uses this interface for sending data to the firewall code. Since I have used ipfwadm as a basis for my NAT administration tool ipnatadm , I have reused the idea and modified it only slightly in order to allow for changes in the structure and size of the data exchanged without having to recompile the kernel each time, so that all checks for validity of the data are done by the module and the kernel just passes all data received via setsockopt() without looking at it.
The function setsockopt() provides one-way communication only. We want to get some data back, however, since we are a curious species who always want to know what is going on behind the scenes. The Linux kernel implements a great feature for this, the proc-filesystem. This is not really a filesystem although it looks just like that to the user. You will notice that all files under /proc have zero size, but if you try, for example, a cat /proc/cpuinfo you will get some output. What happens is that when you access a file of the /proc-hierarchy inside the kernel a function is called -- which is a different one for each file under /proc -- which produces some output that is given to the user space program as the contents of the file. It is also possible to write to some of these files, thereby sending data to a kernel level function, but I have not used this feature but chose setsockopt() instead for the only reason that the example program I used did so and I did not want to spend time writing completely different code. Currently there are two files starting with ip_nat_* in /proc/net/, showing information for core NAT and for virtual servers, when the module has been inserted into the running kernel. They list the contents of the dynamic data structures for the NAT service they represent, such as the NAT rules or what real servers belong to a virtual server rule.

Below is a graph that shows how the module interacts with the kernel and the user:

When an IP packet arrives (1) the kernel calls the NAT module (2) giving it a pointer to where the packet has been stored in kernel memory. The NAT module examines the packet and does address translation if it matches any NAT rule. It then returns (3) the packet to the kernel which in turn continues as usual (4), doing routing or delivering it locally to a process. The same happens to all outgoing packets (5/6), which are packets we route and locally generated packets, just before they are transmitted and just before any ARP is done. The packet is then given back to the kernel for further processing, which means the device driver sends it out on the wire (7).
The user can influence the process by using ipnatadm to send instructions and data to the module via setsockopt() -calls, such as new NAT rules or instructions for deleting a rule or the like. They can view the contents of the dynamic data structures where the module stores the NAT rules and dynamic information collected while running directly by viewing the contents of the NAT-files in the /proc-filesystem or by using ipnatadm , which also uses these files but rewrites the lines into a more human readable format.

Static Address Translation

The standard translation function used by all other NAT functions (dynamic, virtual server,...) does static translation. It gets a pointer to the buffer holding the IP packet and the new source and destination addresses that shall be inserted, including a network mask. This mask is 255.255.255.255 when the function is called by the dynamic NAT functions, since only with static NAT can entire networks be translated using the same parameters for this function. All others have no 1:1 mapping and have to keep track of the real IP to NAT-IP mapping.
Included is the ability to rewrite source and destination UDP and TCP ports, which enhances this NAT implementation further. However, this function must be used with care. Since we do not keep state information about every connection, we cannot determine the correct port for an answer packet which has a replaced port. If we kept state information we would simply look up the connection the packet belongs to and would then know the correct ports. For this reason no bidirectional rules can be used for port rewriting. We always need two rules, one for the inbound and one for the outbound direction, each containing exactly one port the packet has to match in order to be translated. If the port specified is a source or a destination port depends on what port we want to rewrite. Most of the time this will be a destination port, I guess.
The port issue shows how important keeping state information is for NAT to really be flexible.

Dynamic Address Translation

Dynamic address translation has not been implemented. The reasons are

that I think masquerading, which already works well in Linux, is a better choice for most purposes,
and that the other, non-traditional uses for NAT like virtual servers and virtual routes are far more interesting for a sample implementation of NAT.

Despite that I have integrated hooks into the code so that dynamic NAT can easily be added. As for all non-static NAT variants, we have to keep dynamic information about what real IP has been mapped to what NAT-IP. The implementation gets not much harder when exceptions shall be allowed, where certain real IPs shall always be remapped to the same NAT-IP, so that incoming connections to these IPs are possible. All we have to do is to create an entry in the table where we store the current mappings that has no timeout, that means is valid forever. The other dynamic mappings eventually time out and get deleted from the table, so that they can be reused for another real IP.

Virtual Servers

Static NAT does not need to keep any dynamic data about current IP mappings, but for the virtual server function this is necessary. The implication is the standard NAT structure is not enough so that it must be enhanced in order to be able to store all the dynamic information and the data about real servers that answer packets for this virtual server. A virtual server is represented by exactly one NAT rule in the chain of rules, but since it is a dynamic rule (using dynamic data) the pointer reserved for such rules points to a structure that holds virtual server specific data. Also, the fields containing NAT-IPs and NAT-ports are meaningless for all dynamic rules, since the information which IP will be used for the translation is not static but needs to be gained from the dynamic data gathered so far using some algorithm. A virtual server is one virtual IP, so we store this IP in the field where we try to match the destination IP of incoming packets. In the virtual server case this will always be a full IP and not a network, but of course it would work just the same (not exactly, though, because in answer packets back to the client we need to substitute the source IP: the virtual servers IP for the real servers IP). See the figure on page

for how a chain of NAT rules containing a virtual server rule looks like.

I do not store complete connection state information, but only the IPs of clients using the virtual server. I have already covered this topic in section above.

Virtual Routes

I have not implemented this function. Virtual routes are almost the same as virtual servers, the functions and the implementations are similar. Similar does not mean it is the same, where I have to change the destination address for the virtual server function I have to change the route and the source address here, but technologically it is the same, so I did not expect any surprises and left this out. Studying virtual servers revealed enough information for being prepared for an eventual implementation of virtual routes. Furthermore, as I already mentioned, I tried to avoid changes to Linux kernel code whenever possible. Especially the routing code has been radically redesigned in the 2.1.X-kernel series, so I thought it not to be of much use if I changed 2.0-kernel routing code.

Next: Using NAT Up: Title Previous: Virtualizing the Network

Michael Hasenstein