In order to make the implementation as flexible as possible and to avoid interfering with the rest of the kernel whereever possible I have not integrated NAT into an existing framework inside the kernel, like the one used for registering new firewall modules. This would have been a cleaner approach, but this part of the (2.0) Linux kernel has some inconsistencies. Masquerading may work well or not, but the way it is implemented in the kernel does not encourage others to do the same with their code. The firewall code contains a somehow generic interface to register new firewall modules, but much is missing. For example, masquerading needs to keep state information about all connections, the firewall should do the same (which it does not) and NAT in general needs state information. So the thought to have this information collected, stored and managed by the same generic routines, made available for still other future kernel enhancements, has come to me. After thinking about it for a while I decided that it would be too great a task to solve in the time available. It would have involved not just designing this generic routines and data structures, but also rewriting large parts of the firewall- and masquerading code. Implementing a better firewall with state information alone is a task currently tackled by a group of people on the Internet, not to mention that changing the now working masquerading code, which is quite complex, is far more than just a month's work. Not to mention the famous saying ``Never touch a running system'', and that was what I wanted, to have a running (NAT)system, not just framework. This reveals my attitude towards the different software development models: I prefer RAD because I want to have a running system to experiment with rather than playing with concepts that may work or not. That does not mean I started coding immediately. I discussed many aspects of NAT for more than two months, thought it over for another month or two and then I wrote the first line of code. There is a point were mere thoughts do not get us much further, where experiment must start. This is the same as in physics -- experimentalists against theorists -- and as history shows we began becoming as successful as never before in human history when both directions were taken. I thought I had to say this because some people already have asked me why I did not implement it this or that way, to integrate it with firewalling or the like - in general, why did I add a new layer. The simple answer is, simply because I wanted to have a result and did not want to think about anything else than NAT more than necessary. I thought it would be necessary to see if and how NAT works in the first place, to see the implications and the restrictions, before we can go the next step to integrate it with other features. It is just like in physics, like looking for the great theory of everything: for the great picture we have to know the components first.
An interesting issue when doing kernel programming is debugging. No code will work exactly as expected, as every programmer knows. Although I had read about a tool which should allow me some basic debugging on kernel code I did not trust it and used break points and the reset button on the computer instead. It worked quite well, especially after getting used to kernel programming. The first time I encountered a NULL-pointer reference I spent almost a week hunting the bug, by the end of the project finding even more subtle such NULL-references did not take me longer than ten minutes. This is interesting, because I did not really know more about the kernel and about C than when I started, at least nothing that could be useful for finding this kind of bug, so it was my feeling that had developed rather than my knowledge that guided me. This side note shall just show what a powerful ally the subconscious mind is, which stores and processes lots of unbelievably fuzzy data, something the artificial intelligence people have been working on for years.
I started using the new 2.1.X-kernel but soon felt it would be better to use a stable kernel. I could never be sure if a kernel panic was caused by my code or by experimental code contributed by others who worked on different parts of the system. Going back to 2.0 I realized that 2.1 kernel code was significantly cleaner in some places, but despite that I chose stability of the development system over new features and cleaner implementation. In 2.1 even when the kernel worked well I could be sure that there was always some tool that did not work very well with this new kernel.
To achieve independence and flexibility the changes made to the kernel itself are minimal. Most of them do some initialization on kernel startup or are hooks where the real NAT module can register itself with the kernel. The only real NAT code in the kernel are calls to a function in the module that examines each IP packet. These calls take place right when an IP packet comes in and right before an IP packet is going to be transmitted, and are done only if the NAT module has been inserted into the kernel, i.e. without the module the system will behave just as if it had been compiled without NAT support.
NAT-Rules can be bound to the inbound or the outbound direction. If that would not be possible NAT could not be laid around the entire kernel and we could only translate incoming or outgoing packets, loosing flexibility.
I reused large portions of the packet matching code of Linux' firewalling. Here the above discussion about why it has been implemented the way it is and not integrated with other parts of the kernel could be continued, because one might rightly argue that packet matching is the same for firewalling and for NAT. Again, to cite myself, I wanted to implement and test NAT in the time I had for the project, and finding a good and general solution for integrating parts of the kernel was too big a task, especially since
The data structure used to store the various address translation rules is a
list. Each rule specifies some criteria a packet has to match in order to be
translated using the rule's NAT-IPs. These criteria are source and destination
IP and mask, source and destination port, protocol (UDP,ICMP,TCP) and the interface
where the packet has arrived on or is going to be sent through. It is also checked
if the packet matches the reverse of the current rule in order to enable bidirectional
rules. If any of the data used for packet matching is left empty every packet
matches this criteria. Because order is important there was no choice but to
use a linear list. There are also skip-rules which, when matched by the packet's
IP header, cause the NAT function to skip this packet. Each rule has a unique
number so that new rules can be inserted anywhere into the chain. This is another
example where NAT is more flexible than the current firewalling code that could
have been used, and in order to still have this flexibility when integrating
NAT into existing code I would have had to rewrite the other code as well, which
would have been too much work.
Since I wanted the implementation to be as flexible as possible I had to find
a way to allow storing information needed for such different kinds of NAT as
static NAT, dynamic NAT and virtual servers all in the same structure, i.e.
the NAT rules should look equal whatever kind of NAT they represented. Therefore
the the structure that stores exactly one rule contains all the information
needed for packet matching, some additional information like flags, packet and
byte counters and the rule's identification number. The various information
that is different in respect to data types and amount of data for each kind
of NAT gets stored in dynamically allocated memory and the rule only has a pointer
to the start of the area where this information lives.
The following example shows some static NAT rules where all the necessary information is included in the rules and a virtual server rule (the second rule). The virtual server rule needs additional dynamic data, which are a list of all the IPs where all requests to the virtual server should be redirected to and a (more dynamic) list of clients and to what server they have been connected to. The list of clients may become quite large so an appropriate data structure is necessary to ensure a minimum overhead for searching this list. A hash or a (balanced) binary tree would work. I have used a linear list, in contradiction to my own proposal. The reason was because this implementation served for experiments and I thought it easier to track a linear list in case something went wrong and I had to debug the code. It can easily replaced by a more sophisticated data structure allowing much faster search algorithms.
Another puzzle to solve was how to get the NAT rules specified by the user into
the NAT module, which is part of the kernel. Linux has several interfaces for
user- to kernel-communication. One that is special to networking is the call
to the function setsockopt() . The Linux firewalling administration tool
ipfwadm written by Jos Vos (jos@xos.nl) uses this interface for sending
data to the firewall code. Since I have used ipfwadm as a basis for my
NAT administration tool ipnatadm , I have reused the idea and modified
it only slightly in order to allow for changes in the structure and size of
the data exchanged without having to recompile the kernel each time, so that
all checks for validity of the data are done by the module and the kernel just
passes all data received via setsockopt() without looking at it.
The function setsockopt() provides one-way communication only. We want
to get some data back, however, since we are a curious species who always want
to know what is going on behind the scenes. The Linux kernel implements a great
feature for this, the proc-filesystem. This is not really a filesystem although
it looks just like that to the user. You will notice that all files under /proc
have zero size, but if you try, for example, a cat /proc/cpuinfo you
will get some output. What happens is that when you access a file of the /proc-hierarchy
inside the kernel a function is called -- which is a different one for each
file under /proc -- which produces some output that is given to the user space
program as the contents of the file. It is also possible to write to some of
these files, thereby sending data to a kernel level function, but I have not
used this feature but chose setsockopt() instead for the only reason
that the example program I used did so and I did not want to spend time writing
completely different code. Currently there are two files starting with ip_nat_*
in /proc/net/, showing information for core NAT and for virtual servers, when
the module has been inserted into the running kernel. They list the contents
of the dynamic data structures for the NAT service they represent, such as the
NAT rules or what real servers belong to a virtual server rule.
Below is a graph that shows how the module interacts with the kernel and the user:
When an IP packet arrives (1) the kernel calls the NAT module (2) giving it
a pointer to where the packet has been stored in kernel memory. The NAT module
examines the packet and does address translation if it matches any NAT rule.
It then returns (3) the packet to the kernel which in turn continues as usual
(4), doing routing or delivering it locally to a process. The same happens to
all outgoing packets (5/6), which are packets we route and locally generated
packets, just before they are transmitted and just before any ARP is done. The
packet is then given back to the kernel for further processing, which means
the device driver sends it out on the wire (7).
The user can influence the process by using ipnatadm to send instructions
and data to the module via setsockopt() -calls, such as new NAT rules
or instructions for deleting a rule or the like. They can view the contents
of the dynamic data structures where the module stores the NAT rules and dynamic
information collected while running directly by viewing the contents of the
NAT-files in the /proc-filesystem or by using ipnatadm , which also uses
these files but rewrites the lines into a more human readable format.
I do not store complete connection state information, but only the IPs of clients using the virtual server. I have already covered this topic in section above.