#, fuzzy msgid "" msgstr "" "Project-Id-Version: PACKAGE VERSION\n" "POT-Creation-Date: 2018-12-08 14:45-0200\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" "Content-Transfer-Encoding: 8bit\n" #. Put one translator per line, in the form NAME , YEAR1, YEAR2 msgctxt "_" msgid "translator-credits" msgstr "" #. (itstool) path: info/title #: article.translate.xml:6 msgid "Design elements of the FreeBSD VM system" msgstr "" #. (itstool) path: affiliation/address #: article.translate.xml:11 #, no-wrap msgid "\n" " dillon@apollo.backplane.com\n" " " msgstr "" #. (itstool) path: authorgroup/author #: article.translate.xml:10 msgid "" "MatthewDillon <_:address-1/> " msgstr "" #. (itstool) path: legalnotice/para #: article.translate.xml:18 msgid "FreeBSD is a registered trademark of the FreeBSD Foundation." msgstr "" #. (itstool) path: legalnotice/para #: article.translate.xml:20 msgid "Linux is a registered trademark of Linus Torvalds." msgstr "" #. (itstool) path: legalnotice/para #: article.translate.xml:22 msgid "" "Microsoft, IntelliMouse, MS-DOS, Outlook, Windows, Windows Media and Windows " "NT are either registered trademarks or trademarks of Microsoft Corporation " "in the United States and/or other countries." msgstr "" #. (itstool) path: legalnotice/para #: article.translate.xml:26 msgid "" "Motif, OSF/1, and UNIX are registered trademarks and IT DialTone and The " "Open Group are trademarks of The Open Group in the United States and other " "countries." msgstr "" #. (itstool) path: legalnotice/para #: article.translate.xml:30 msgid "" "Many of the designations used by manufacturers and sellers to distinguish " "their products are claimed as trademarks. Where those designations appear in " "this document, and the FreeBSD Project was aware of the trademark claim, the " "designations have been followed by the or the ® symbol." msgstr "" #. (itstool) path: info/pubdate #. (itstool) path: info/releaseinfo #: article.translate.xml:38 article.translate.xml:40 msgid "" "$FreeBSD: head/en_US.ISO8859-1/articles/vm-design/article.xml 43184 2013-11-" "13 07:52:45Z hrs $" msgstr "" #. (itstool) path: abstract/para #: article.translate.xml:43 msgid "" "The title is really just a fancy way of saying that I am going to attempt to " "describe the whole VM enchilada, hopefully in a way that everyone can follow." " For the last year I have concentrated on a number of major kernel " "subsystems within FreeBSD, with the VM and Swap subsystems being the most " "interesting and NFS being a necessary chore. I rewrote only " "small portions of the code. In the VM arena the only major rewrite I have " "done is to the swap subsystem. Most of my work was cleanup and maintenance, " "with only moderate code rewriting and no major algorithmic adjustments " "within the VM subsystem. The bulk of the VM subsystem's theoretical base " "remains unchanged and a lot of the credit for the modernization effort in " "the last few years belongs to John Dyson and David Greenman. Not being a " "historian like Kirk I will not attempt to tag all the various features with " "peoples names, since I will invariably get it wrong." msgstr "" #. (itstool) path: legalnotice/para #: article.translate.xml:60 msgid "" "This article was originally published in the January 2000 issue of DaemonNews. This version of " "the article may include updates from Matt and other authors to reflect " "changes in FreeBSD's VM implementation." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:68 msgid "Introduction" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:70 msgid "" "Before moving along to the actual design let's spend a little time on the " "necessity of maintaining and modernizing any long-living codebase. In the " "programming world, algorithms tend to be more important than code and it is " "precisely due to BSD's academic roots that a great deal of attention was " "paid to algorithm design from the beginning. More attention paid to the " "design generally leads to a clean and flexible codebase that can be fairly " "easily modified, extended, or replaced over time. While BSD is considered an " "old operating system by some people, those of us who work on " "it tend to view it more as a mature codebase which has " "various components modified, extended, or replaced with modern code. It has " "evolved, and FreeBSD is at the bleeding edge no matter how old some of the " "code might be. This is an important distinction to make and one that is " "unfortunately lost to many people. The biggest error a programmer can make " "is to not learn from history, and this is precisely the error that many " "other modern operating systems have made. Windows NT is the best example of this, and the " "consequences have been dire. Linux also makes this mistake to some " "degree—enough that we BSD folk can make small jokes about it every once in a " "while, anyway. Linux's problem is simply one of a lack of experience and " "history to compare ideas against, a problem that is easily and rapidly being " "addressed by the Linux community in the same way it has been addressed in " "the BSD community—by continuous code development. The Windows NT folk, on the other hand, repeatedly " "make the same mistakes solved by UNIX decades ago and then spend years fixing them. Over and over again." " They have a severe case of not designed here and we " "are always right because our marketing department says so. I have " "little tolerance for anyone who cannot learn from history." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:99 msgid "" "Much of the apparent complexity of the FreeBSD design, especially in the VM/" "Swap subsystem, is a direct result of having to solve serious performance " "issues that occur under various conditions. These issues are not due to bad " "algorithmic design but instead rise from environmental factors. In any " "direct comparison between platforms, these issues become most apparent when " "system resources begin to get stressed. As I describe FreeBSD's VM/Swap " "subsystem the reader should always keep two points in mind:" msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:110 msgid "" "The most important aspect of performance design is what is known as " "Optimizing the Critical Path. It is often the case that " "performance optimizations add a little bloat to the code in order to make " "the critical path perform better." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:117 msgid "" "A solid, generalized design outperforms a heavily-optimized design over the " "long run. While a generalized design may end up being slower than an heavily-" "optimized design when they are first implemented, the generalized design " "tends to be easier to adapt to changing conditions and the heavily-optimized " "design winds up having to be thrown away." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:126 msgid "" "Any codebase that will survive and be maintainable for years must therefore " "be designed properly from the beginning even if it costs some performance. " "Twenty years ago people were still arguing that programming in assembly was " "better than programming in a high-level language because it produced code " "that was ten times as fast. Today, the fallibility of that argument is " "obvious  — as are the parallels to algorithmic design and code " "generalization." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:136 msgid "VM Objects" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:138 msgid "" "The best way to begin describing the FreeBSD VM system is to look at it from " "the perspective of a user-level process. Each user process sees a single, " "private, contiguous VM address space containing several types of memory " "objects. These objects have various characteristics. Program code and " "program data are effectively a single memory-mapped file (the binary file " "being run), but program code is read-only while program data is copy-on-" "write. Program BSS is just memory allocated and filled with zeros on demand, " "called demand zero page fill. Arbitrary files can be memory-mapped into the " "address space as well, which is how the shared library mechanism works. Such " "mappings can require modifications to remain private to the process making " "them. The fork system call adds an entirely new dimension to the VM " "management problem on top of the complexity already given." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:152 msgid "" "A program binary data page (which is a basic copy-on-write page) illustrates " "the complexity. A program binary contains a preinitialized data section " "which is initially mapped directly from the program file. When a program is " "loaded into a process's VM space, this area is initially memory-mapped and " "backed by the program binary itself, allowing the VM system to free/reuse " "the page and later load it back in from the binary. The moment a process " "modifies this data, however, the VM system must make a private copy of the " "page for that process. Since the private copy has been modified, the VM " "system may no longer free it, because there is no longer any way to restore " "it later on." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:163 msgid "" "You will notice immediately that what was originally a simple file mapping " "has become much more complex. Data may be modified on a page-by-page basis " "whereas the file mapping encompasses many pages at once. The complexity " "further increases when a process forks. When a process forks, the result is " "two processes—each with their own private address spaces, including any " "modifications made by the original process prior to the call to " "fork(). It would be silly for the VM system to make a " "complete copy of the data at the time of the fork() " "because it is quite possible that at least one of the two processes will " "only need to read from that page from then on, allowing the original page to " "continue to be used. What was a private page is made copy-on-write again, " "since each process (parent and child) expects their own personal post-fork " "modifications to remain private to themselves and not effect the other." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:178 msgid "" "FreeBSD manages all of this with a layered VM Object model. The original " "binary program file winds up being the lowest VM Object layer. A copy-on-" "write layer is pushed on top of that to hold those pages which had to be " "copied from the original file. If the program modifies a data page belonging " "to the original file the VM system takes a fault and makes a copy of the " "page in the higher layer. When a process forks, additional VM Object layers " "are pushed on. This might make a little more sense with a fairly basic " "example. A fork() is a common operation for any *BSD " "system, so this example will consider a program that starts up, and forks. " "When the process starts, the VM system creates an object layer, let's call " "this A:" msgstr "" #. (itstool) path: imageobject/imagedata #. This is a reference to an external file such as an image or video. When #. the file changes, the md5 hash will change to let you know you need to #. update your localized copy. The msgstr is not used at all. Set it to #. whatever you like once you have updated your copy of the file. #: article.translate.xml:192 msgctxt "_" msgid "external ref='fig1' md5='__failed__'" msgstr "" #. (itstool) path: textobject/literallayout #: article.translate.xml:196 #, no-wrap msgid "+---------------+\n" "| A |\n" "+---------------+" msgstr "" #. (itstool) path: sect1/mediaobject #: article.translate.xml:190 msgid "" " <_:" "literallayout-1/> A picture " msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:206 msgid "" "A represents the file—pages may be paged in and out of the file's physical " "media as necessary. Paging in from the disk is reasonable for a program, but " "we really do not want to page back out and overwrite the executable. The VM " "system therefore creates a second layer, B, that will be physically backed " "by swap space:" msgstr "" #. (itstool) path: imageobject/imagedata #. This is a reference to an external file such as an image or video. When #. the file changes, the md5 hash will change to let you know you need to #. update your localized copy. The msgstr is not used at all. Set it to #. whatever you like once you have updated your copy of the file. #: article.translate.xml:214 msgctxt "_" msgid "external ref='fig2' md5='__failed__'" msgstr "" #. (itstool) path: textobject/literallayout #: article.translate.xml:218 #, no-wrap msgid "" "+---------------+\n" "| B |\n" "+---------------+\n" "| A |\n" "+---------------+" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:226 msgid "" "On the first write to a page after this, a new page is created in B, and its " "contents are initialized from A. All pages in B can be paged in or out to a " "swap device. When the program forks, the VM system creates two new object " "layers—C1 for the parent, and C2 for the child—that rest on top of B:" msgstr "" #. (itstool) path: imageobject/imagedata #. This is a reference to an external file such as an image or video. When #. the file changes, the md5 hash will change to let you know you need to #. update your localized copy. The msgstr is not used at all. Set it to #. whatever you like once you have updated your copy of the file. #: article.translate.xml:234 msgctxt "_" msgid "external ref='fig3' md5='__failed__'" msgstr "" #. (itstool) path: textobject/literallayout #: article.translate.xml:238 #, no-wrap msgid "" "+-------+-------+\n" "| C1 | C2 |\n" "+-------+-------+\n" "| B |\n" "+---------------+\n" "| A |\n" "+---------------+" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:248 msgid "" "In this case, let's say a page in B is modified by the original parent " "process. The process will take a copy-on-write fault and duplicate the page " "in C1, leaving the original page in B untouched. Now, let's say the same " "page in B is modified by the child process. The process will take a copy-on-" "write fault and duplicate the page in C2. The original page in B is now " "completely hidden since both C1 and C2 have a copy and B could theoretically " "be destroyed if it does not represent a real file; however, " "this sort of optimization is not trivial to make because it is so fine-" "grained. FreeBSD does not make this optimization. Now, suppose (as is often " "the case) that the child process does an exec(). Its " "current address space is usually replaced by a new address space " "representing a new file. In this case, the C2 layer is destroyed:" msgstr "" #. (itstool) path: imageobject/imagedata #. This is a reference to an external file such as an image or video. When #. the file changes, the md5 hash will change to let you know you need to #. update your localized copy. The msgstr is not used at all. Set it to #. whatever you like once you have updated your copy of the file. #: article.translate.xml:264 msgctxt "_" msgid "external ref='fig4' md5='__failed__'" msgstr "" #. (itstool) path: textobject/literallayout #: article.translate.xml:268 #, no-wrap msgid "" "+-------+\n" "| C1 |\n" "+-------+-------+\n" "| B |\n" "+---------------+\n" "| A |\n" "+---------------+" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:278 msgid "" "In this case, the number of children of B drops to one, and all accesses to " "B now go through C1. This means that B and C1 can be collapsed together. Any " "pages in B that also exist in C1 are deleted from B during the collapse. " "Thus, even though the optimization in the previous step could not be made, " "we can recover the dead pages when either of the processes exit or " "exec()." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:285 msgid "" "This model creates a number of potential problems. The first is that you can " "wind up with a relatively deep stack of layered VM Objects which can cost " "scanning time and memory when you take a fault. Deep layering can occur when " "processes fork and then fork again (either parent or child). The second " "problem is that you can wind up with dead, inaccessible pages deep in the " "stack of VM Objects. In our last example if both the parent and child " "processes modify the same page, they both get their own private copies of " "the page and the original page in B is no longer accessible by anyone. That " "page in B can be freed." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:295 msgid "" "FreeBSD solves the deep layering problem with a special optimization called " "the All Shadowed Case. This case occurs if either C1 or C2 " "take sufficient COW faults to completely shadow all pages in B. Lets say " "that C1 achieves this. C1 can now bypass B entirely, so rather then have C1-" ">B->A and C2->B->A we now have C1->A and C2->B->A. But " "look what also happened—now B has only one reference (C2), so we can " "collapse B and C2 together. The end result is that B is deleted entirely and " "we have C1->A and C2->A. It is often the case that B will contain a " "large number of pages and neither C1 nor C2 will be able to completely " "overshadow it. If we fork again and create a set of D layers, however, it is " "much more likely that one of the D layers will eventually be able to " "completely overshadow the much smaller dataset represented by C1 or C2. The " "same optimization will work at any point in the graph and the grand result " "of this is that even on a heavily forked machine VM Object stacks tend to " "not get much deeper then 4. This is true of both the parent and the children " "and true whether the parent is doing the forking or whether the children " "cascade forks." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:313 msgid "" "The dead page problem still exists in the case where C1 or C2 do not " "completely overshadow B. Due to our other optimizations this case does not " "represent much of a problem and we simply allow the pages to be dead. If the " "system runs low on memory it will swap them out, eating a little swap, but " "that is it." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:319 msgid "" "The advantage to the VM Object model is that fork() is " "extremely fast, since no real data copying need take place. The disadvantage " "is that you can build a relatively complex VM Object layering that slows " "page fault handling down a little, and you spend memory managing the VM " "Object structures. The optimizations FreeBSD makes proves to reduce the " "problems enough that they can be ignored, leaving no real disadvantage." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:329 msgid "SWAP Layers" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:331 msgid "" "Private data pages are initially either copy-on-write or zero-fill pages. " "When a change, and therefore a copy, is made, the original backing object " "(usually a file) can no longer be used to save a copy of the page when the " "VM system needs to reuse it for other purposes. This is where SWAP comes in. " "SWAP is allocated to create backing store for memory that does not otherwise " "have it. FreeBSD allocates the swap management structure for a VM Object " "only when it is actually needed. However, the swap management structure has " "had problems historically:" msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:343 msgid "" "Under FreeBSD 3.X the swap management structure preallocates an array that " "encompasses the entire object requiring swap backing store—even if only a " "few pages of that object are swap-backed. This creates a kernel memory " "fragmentation problem when large objects are mapped, or processes with large " "runsizes (RSS) fork." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:352 msgid "" "Also, in order to keep track of swap space, a list of holes " "is kept in kernel memory, and this tends to get severely fragmented as well. " "Since the list of holes is a linear list, the swap allocation " "and freeing performance is a non-optimal O(n)-per-page." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:360 msgid "" "It requires kernel memory allocations to take place during the swap freeing " "process, and that creates low memory deadlock problems." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:366 msgid "" "The problem is further exacerbated by holes created due to the interleaving " "algorithm." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:371 msgid "" "Also, the swap block map can become fragmented fairly easily resulting in " "non-contiguous allocations." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:376 msgid "" "Kernel memory must also be allocated on the fly for additional swap " "management structures when a swapout occurs." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:381 msgid "" "It is evident from that list that there was plenty of room for improvement. " "For FreeBSD 4.X, I completely rewrote the swap subsystem:" msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:387 msgid "" "Swap management structures are allocated through a hash table rather than a " "linear array giving them a fixed allocation size and much finer granularity." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:393 msgid "" "Rather then using a linearly linked list to keep track of swap space " "reservations, it now uses a bitmap of swap blocks arranged in a radix tree " "structure with free-space hinting in the radix node structures. This " "effectively makes swap allocation and freeing an O(1) operation." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:401 msgid "" "The entire radix tree bitmap is also preallocated in order to avoid having " "to allocate kernel memory during critical low memory swapping operations. " "After all, the system tends to swap when it is low on memory so we should " "avoid allocating kernel memory at such times in order to avoid potential " "deadlocks." msgstr "" #. (itstool) path: listitem/para #: article.translate.xml:410 msgid "" "To reduce fragmentation the radix tree is capable of allocating large " "contiguous chunks at once, skipping over smaller fragmented chunks." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:416 msgid "" "I did not take the final step of having an allocating hint pointer that would trundle through a portion of swap as allocations were made " "in order to further guarantee contiguous allocations or at least locality of " "reference, but I ensured that such an addition could be made." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:424 msgid "When to free a page" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:426 msgid "" "Since the VM system uses all available memory for disk caching, there are " "usually very few truly-free pages. The VM system depends on being able to " "properly choose pages which are not in use to reuse for new allocations. " "Selecting the optimal pages to free is possibly the single-most important " "function any VM system can perform because if it makes a poor selection, the " "VM system may be forced to unnecessarily retrieve pages from disk, seriously " "degrading system performance." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:434 msgid "" "How much overhead are we willing to suffer in the critical path to avoid " "freeing the wrong page? Each wrong choice we make will cost us hundreds of " "thousands of CPU cycles and a noticeable stall of the affected processes, so " "we are willing to endure a significant amount of overhead in order to be " "sure that the right page is chosen. This is why FreeBSD tends to outperform " "other systems when memory resources become stressed." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:442 msgid "" "The free page determination algorithm is built upon a history of the use of " "memory pages. To acquire this history, the system takes advantage of a page-" "used bit feature that most hardware page tables have." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:446 msgid "" "In any case, the page-used bit is cleared and at some later point the VM " "system comes across the page again and sees that the page-used bit has been " "set. This indicates that the page is still being actively used. If the bit " "is still clear it is an indication that the page is not being actively used. " "By testing this bit periodically, a use history (in the form of a counter) " "for the physical page is developed. When the VM system later needs to free " "up some pages, checking this history becomes the cornerstone of determining " "the best candidate page to reuse." msgstr "" #. (itstool) path: sidebar/title #: article.translate.xml:456 msgid "What if the hardware has no page-used bit?" msgstr "" #. (itstool) path: sidebar/para #: article.translate.xml:458 msgid "" "For those platforms that do not have this feature, the system actually " "emulates a page-used bit. It unmaps or protects a page, forcing a page fault " "if the page is accessed again. When the page fault is taken, the system " "simply marks the page as having been used and unprotects the page so that it " "may be used. While taking such page faults just to determine if a page is " "being used appears to be an expensive proposition, it is much less expensive " "than reusing the page for some other purpose only to find that a process " "needs it back and then have to go to disk." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:469 msgid "" "FreeBSD makes use of several page queues to further refine the selection of " "pages to reuse as well as to determine when dirty pages must be flushed to " "their backing store. Since page tables are dynamic entities under FreeBSD, " "it costs virtually nothing to unmap a page from the address space of any " "processes using it. When a page candidate has been chosen based on the page-" "use counter, this is precisely what is done. The system must make a " "distinction between clean pages which can theoretically be freed up at any " "time, and dirty pages which must first be written to their backing store " "before being reusable. When a page candidate has been found it is moved to " "the inactive queue if it is dirty, or the cache queue if it is clean. A " "separate algorithm based on the dirty-to-clean page ratio determines when " "dirty pages in the inactive queue must be flushed to disk. Once this is " "accomplished, the flushed pages are moved from the inactive queue to the " "cache queue. At this point, pages in the cache queue can still be " "reactivated by a VM fault at relatively low cost. However, pages in the " "cache queue are considered to be immediately freeable and " "will be reused in an LRU (least-recently used) fashion when the system needs " "to allocate new memory." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:489 msgid "" "It is important to note that the FreeBSD VM system attempts to separate " "clean and dirty pages for the express reason of avoiding unnecessary flushes " "of dirty pages (which eats I/O bandwidth), nor does it move pages between " "the various page queues gratuitously when the memory subsystem is not being " "stressed. This is why you will see some systems with very low cache queue " "counts and high active queue counts when doing a systat -vm command. As the VM system becomes more stressed, it makes a greater " "effort to maintain the various page queues at the levels determined to be " "the most effective." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:499 msgid "" "An urban myth has circulated for years that Linux did a better job avoiding " "swapouts than FreeBSD, but this in fact is not true. What was actually " "occurring was that FreeBSD was proactively paging out unused pages in order " "to make room for more disk cache while Linux was keeping unused pages in " "core and leaving less memory available for cache and process pages. I do not " "know whether this is still true today." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:509 msgid "Pre-Faulting and Zeroing Optimizations" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:511 msgid "" "Taking a VM fault is not expensive if the underlying page is already in core " "and can simply be mapped into the process, but it can become expensive if " "you take a whole lot of them on a regular basis. A good example of this is " "running a program such as ls1 or " "ps1 over and over again. If the program binary is mapped into " "memory but not mapped into the page table, then all the pages that will be " "accessed by the program will have to be faulted in every time the program is " "run. This is unnecessary when the pages in question are already in the VM " "Cache, so FreeBSD will attempt to pre-populate a process's page tables with " "those pages that are already in the VM Cache. One thing that FreeBSD does " "not yet do is pre-copy-on-write certain pages on exec. For example, if you " "run the ls1 program while running vmstat 1 " "you will notice that it always takes a certain number of page faults, even " "when you run it over and over again. These are zero-fill faults, not program " "code faults (which were pre-faulted in already). Pre-copying pages on exec " "or fork is an area that could use more study." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:529 msgid "" "A large percentage of page faults that occur are zero-fill faults. You can " "usually see this by observing the vmstat -s output. These " "occur when a process accesses pages in its BSS area. The BSS area is " "expected to be initially zero but the VM system does not bother to allocate " "any memory at all until the process actually accesses it. When a fault " "occurs the VM system must not only allocate a new page, it must zero it as " "well. To optimize the zeroing operation the VM system has the ability to pre-" "zero pages and mark them as such, and to request pre-zeroed pages when zero-" "fill faults occur. The pre-zeroing occurs whenever the CPU is idle but the " "number of pages the system pre-zeros is limited in order to avoid blowing " "away the memory caches. This is an excellent example of adding complexity to " "the VM system in order to optimize the critical path." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:545 msgid "Page Table Optimizations" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:547 msgid "" "The page table optimizations make up the most contentious part of the " "FreeBSD VM design and they have shown some strain with the advent of serious " "use of mmap(). I think this is actually a feature of " "most BSDs though I am not sure when it was first introduced. There are two " "major optimizations. The first is that hardware page tables do not contain " "persistent state but instead can be thrown away at any time with only a " "minor amount of management overhead. The second is that every active page " "table entry in the system has a governing pv_entry " "structure which is tied into the vm_page structure. " "FreeBSD can simply iterate through those mappings that are known to exist " "while Linux must check all page tables that might " "contain a specific mapping to see if it does, which can achieve O(n^2) " "overhead in certain situations. It is because of this that FreeBSD tends to " "make better choices on which pages to reuse or swap when memory is stressed, " "giving it better performance under load. However, FreeBSD requires kernel " "tuning to accommodate large-shared-address-space situations such as those " "that can occur in a news system because it may run out of pv_entry structures." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:567 msgid "" "Both Linux and FreeBSD need work in this area. FreeBSD is trying to maximize " "the advantage of a potentially sparse active-mapping model (not all " "processes need to map all pages of a shared library, for example), whereas " "Linux is trying to simplify its algorithms. FreeBSD generally has the " "performance advantage here at the cost of wasting a little extra memory, but " "FreeBSD breaks down in the case where a large file is massively shared " "across hundreds of processes. Linux, on the other hand, breaks down in the " "case where many processes are sparsely-mapping the same shared library and " "also runs non-optimally when trying to determine whether a page can be " "reused or not." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:580 msgid "Page Coloring" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:582 msgid "" "We will end with the page coloring optimizations. Page coloring is a " "performance optimization designed to ensure that accesses to contiguous " "pages in virtual memory make the best use of the processor cache. In ancient " "times (i.e. 10+ years ago) processor caches tended to map virtual memory " "rather than physical memory. This led to a huge number of problems including " "having to clear the cache on every context switch in some cases, and " "problems with data aliasing in the cache. Modern processor caches map " "physical memory precisely to solve those problems. This means that two side-" "by-side pages in a processes address space may not correspond to two side-by-" "side pages in the cache. In fact, if you are not careful side-by-side pages " "in virtual memory could wind up using the same page in the processor " "cache—leading to cacheable data being thrown away prematurely and reducing " "CPU performance. This is true even with multi-way set-associative caches " "(though the effect is mitigated somewhat)." msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:598 msgid "" "FreeBSD's memory allocation code implements page coloring optimizations, " "which means that the memory allocation code will attempt to locate free " "pages that are contiguous from the point of view of the cache. For example, " "if page 16 of physical memory is assigned to page 0 of a process's virtual " "memory and the cache can hold 4 pages, the page coloring code will not " "assign page 20 of physical memory to page 1 of a process's virtual memory. " "It would, instead, assign page 21 of physical memory. The page coloring code " "attempts to avoid assigning page 20 because this maps over the same cache " "memory as page 16 and would result in non-optimal caching. This code adds a " "significant amount of complexity to the VM memory allocation subsystem as " "you can well imagine, but the result is well worth the effort. Page Coloring " "makes VM memory as deterministic as physical memory in regards to cache " "performance." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:615 msgid "Conclusion" msgstr "" #. (itstool) path: sect1/para #: article.translate.xml:617 msgid "" "Virtual memory in modern operating systems must address a number of " "different issues efficiently and for many different usage patterns. The " "modular and algorithmic approach that BSD has historically taken allows us " "to study and understand the current implementation as well as relatively " "cleanly replace large sections of the code. There have been a number of " "improvements to the FreeBSD VM system in the last several years, and work is " "ongoing." msgstr "" #. (itstool) path: sect1/title #: article.translate.xml:627 msgid "Bonus QA session by Allen Briggs briggs@ninthwonder.com" msgstr "" #. (itstool) path: question/para #: article.translate.xml:633 msgid "" "What is the interleaving algorithm that you refer to in your " "listing of the ills of the FreeBSD 3.X swap arrangements?" msgstr "" #. (itstool) path: answer/para #: article.translate.xml:639 msgid "" "FreeBSD uses a fixed swap interleave which defaults to 4. This means that " "FreeBSD reserves space for four swap areas even if you only have one, two, " "or three. Since swap is interleaved the linear address space representing " "the four swap areas will be fragmented if you do not actually " "have four swap areas. For example, if you have two swap areas A and B " "FreeBSD's address space representation for that swap area will be " "interleaved in blocks of 16 pages:" msgstr "" #. (itstool) path: answer/literallayout #: article.translate.xml:648 #, no-wrap msgid "A B C D A B C D A B C D A B C D" msgstr "" #. (itstool) path: answer/para #: article.translate.xml:650 msgid "" "FreeBSD 3.X uses a sequential list of free regions approach " "to accounting for the free swap areas. The idea is that large blocks of free " "linear space can be represented with a single list node (kern/" "subr_rlist.c). But due to the fragmentation the sequential list " "winds up being insanely fragmented. In the above example, completely unused " "swap will have A and B shown as free and C and D shown as " "all allocated. Each A-B sequence requires a list node to " "account for because C and D are holes, so the list node cannot be combined " "with the next A-B sequence." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:662 msgid "" "Why do we interleave our swap space instead of just tack swap areas onto the " "end and do something fancier? Because it is a whole lot easier to allocate " "linear swaths of an address space and have the result automatically be " "interleaved across multiple disks than it is to try to put that " "sophistication elsewhere." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:668 msgid "" "The fragmentation causes other problems. Being a linear list under 3.X, and " "having such a huge amount of inherent fragmentation, allocating and freeing " "swap winds up being an O(N) algorithm instead of an O(1) algorithm. Combined " "with other factors (heavy swapping) and you start getting into O(N^2) and " "O(N^3) levels of overhead, which is bad. The 3.X system may also need to " "allocate KVM during a swap operation to create a new list node which can " "lead to a deadlock if the system is trying to pageout pages in a low-memory " "situation." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:678 msgid "" "Under 4.X we do not use a sequential list. Instead we use a radix tree and " "bitmaps of swap blocks rather than ranged list nodes. We take the hit of " "preallocating all the bitmaps required for the entire swap area up front but " "it winds up wasting less memory due to the use of a bitmap (one bit per " "block) instead of a linked list of nodes. The use of a radix tree instead of " "a sequential list gives us nearly O(1) performance no matter how fragmented " "the tree becomes." msgstr "" #. (itstool) path: question/para #: article.translate.xml:691 msgid "" "How is the separation of clean and dirty (inactive) pages related to the " "situation where you see low cache queue counts and high active queue counts " "in systat -vm? Do the systat stats roll the active and " "dirty pages together for the active queue count?" msgstr "" #. (itstool) path: question/para #: article.translate.xml:697 msgid "I do not get the following:" msgstr "" #. (itstool) path: blockquote/para #: article.translate.xml:700 msgid "" "It is important to note that the FreeBSD VM system attempts to separate " "clean and dirty pages for the express reason of avoiding unnecessary flushes " "of dirty pages (which eats I/O bandwidth), nor does it move pages between " "the various page queues gratuitously when the memory subsystem is not being " "stressed. This is why you will see some systems with very low cache queue " "counts and high active queue counts when doing a systat -vm command." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:712 msgid "" "Yes, that is confusing. The relationship is goal verses " "reality. Our goal is to separate the pages but the reality is " "that if we are not in a memory crunch, we do not really have to." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:717 msgid "" "What this means is that FreeBSD will not try very hard to separate out dirty " "pages (inactive queue) from clean pages (cache queue) when the system is not " "being stressed, nor will it try to deactivate pages (active queue -> " "inactive queue) when the system is not being stressed, even if they are not " "being used." msgstr "" #. (itstool) path: question/para #: article.translate.xml:727 msgid "" "In the ls1 / vmstat 1 example, would not " "some of the page faults be data page faults (COW from executable file to " "private page)? I.e., I would expect the page faults to be some zero-fill and " "some program data. Or are you implying that FreeBSD does do pre-COW for the " "program data?" msgstr "" #. (itstool) path: answer/para #: article.translate.xml:735 msgid "" "A COW fault can be either zero-fill or program-data. The mechanism is the " "same either way because the backing program-data is almost certainly already " "in the cache. I am indeed lumping the two together. FreeBSD does not pre-COW " "program data or zero-fill, but it does pre-map pages " "that exist in its cache." msgstr "" #. (itstool) path: question/para #: article.translate.xml:746 msgid "" "In your section on page table optimizations, can you give a little more " "detail about pv_entry and vm_page (or " "should vm_page be vm_pmap—as in 4.4, cf. pp. 180-181 of " "McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of operation/" "reaction would require scanning the mappings?" msgstr "" #. (itstool) path: question/para #: article.translate.xml:753 msgid "" "How does Linux do in the case where FreeBSD breaks down (sharing a large " "file mapping over many processes)?" msgstr "" #. (itstool) path: answer/para #: article.translate.xml:758 msgid "" "A vm_page represents an (object,index#) tuple. A " "pv_entry represents a hardware page table entry (pte). If " "you have five processes sharing the same physical page, and three of those " "processes's page tables actually map the page, that page will be represented " "by a single vm_page structure and three " "pv_entry structures." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:766 msgid "" "pv_entry structures only represent pages mapped by the " "MMU (one pv_entry represents one pte). This means that " "when we need to remove all hardware references to a vm_page (in order to reuse the page for something else, page it out, clear " "it, dirty it, and so forth) we can simply scan the linked list of " "pv_entry's associated with that vm_page to remove or modify the pte's from their page tables." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:776 msgid "" "Under Linux there is no such linked list. In order to remove all the " "hardware page table mappings for a vm_page linux must " "index into every VM object that might have mapped the " "page. For example, if you have 50 processes all mapping the same shared " "library and want to get rid of page X in that library, you need to index " "into the page table for each of those 50 processes even if only 10 of them " "have actually mapped the page. So Linux is trading off the simplicity of its " "design against performance. Many VM algorithms which are O(1) or (small N) " "under FreeBSD wind up being O(N), O(N^2), or worse under Linux. Since the " "pte's representing a particular page in an object tend to be at the same " "offset in all the page tables they are mapped in, reducing the number of " "accesses into the page tables at the same pte offset will often avoid " "blowing away the L1 cache line for that offset, which can lead to better " "performance." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:793 msgid "" "FreeBSD has added complexity (the pv_entry scheme) in " "order to increase performance (to limit page table accesses to " "only those pte's that need to be modified)." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:798 msgid "" "But FreeBSD has a scaling problem that Linux does not in that there are a " "limited number of pv_entry structures and this causes " "problems when you have massive sharing of data. In this case you may run out " "of pv_entry structures even though there is plenty of " "free memory available. This can be fixed easily enough by bumping up the " "number of pv_entry structures in the kernel config, but " "we really need to find a better way to do it." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:808 msgid "" "In regards to the memory overhead of a page table verses the " "pv_entry scheme: Linux uses permanent page " "tables that are not throw away, but does not need a pv_entry for each potentially mapped pte. FreeBSD uses throw away page tables but adds in a pv_entry structure for " "each actually-mapped pte. I think memory utilization winds up being about " "the same, giving FreeBSD an algorithmic advantage with its ability to throw " "away page tables at will with very low overhead." msgstr "" #. (itstool) path: question/para #: article.translate.xml:823 msgid "" "Finally, in the page coloring section, it might help to have a little more " "description of what you mean here. I did not quite follow it." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:829 msgid "" "Do you know how an L1 hardware memory cache works? I will explain: Consider " "a machine with 16MB of main memory but only 128K of L1 cache. Generally the " "way this cache works is that each 128K block of main memory uses the " "same 128K of cache. If you access offset 0 in main " "memory and then offset 128K in main memory you can wind up throwing away the " "cached data you read from offset 0!" msgstr "" #. (itstool) path: answer/para #: article.translate.xml:837 msgid "" "Now, I am simplifying things greatly. What I just described is what is " "called a direct mapped hardware memory cache. Most modern " "caches are what are called 2-way-set-associative or 4-way-set-associative " "caches. The set-associatively allows you to access up to N different memory " "regions that overlap the same cache memory without destroying the previously " "cached data. But only N." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:845 msgid "" "So if I have a 4-way set associative cache I can access offset 0, offset " "128K, 256K and offset 384K and still be able to access offset 0 again and " "have it come from the L1 cache. If I then access offset 512K, however, one " "of the four previously cached data objects will be thrown away by the cache." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:851 msgid "" "It is extremely important… extremely important for most " "of a processor's memory accesses to be able to come from the L1 cache, " "because the L1 cache operates at the processor frequency. The moment you " "have an L1 cache miss and have to go to the L2 cache or to main memory, the " "processor will stall and potentially sit twiddling its fingers for " "hundreds of instructions worth of time waiting for a " "read from main memory to complete. Main memory (the dynamic ram you stuff " "into a computer) is slow, when compared to the speed of " "a modern processor core." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:863 msgid "" "Ok, so now onto page coloring: All modern memory caches are what are known " "as physical caches. They cache physical memory " "addresses, not virtual memory addresses. This allows the cache to be left " "alone across a process context switch, which is very important." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:869 msgid "" "But in the UNIX world you are " "dealing with virtual address spaces, not physical address spaces. Any " "program you write will see the virtual address space given to it. The actual " "physical pages underlying that virtual address space " "are not necessarily physically contiguous! In fact, you might have two pages " "that are side by side in a processes address space which wind up being at " "offset 0 and offset 128K in physical memory." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:878 msgid "" "A program normally assumes that two side-by-side pages will be optimally " "cached. That is, that you can access data objects in both pages without " "having them blow away each other's cache entry. But this is only true if the " "physical pages underlying the virtual address space are contiguous (insofar " "as the cache is concerned)." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:885 msgid "" "This is what Page coloring does. Instead of assigning random physical pages to virtual addresses, which may result in non-" "optimal cache performance, Page coloring assigns reasonably-" "contiguous physical pages to virtual addresses. Thus programs can " "be written under the assumption that the characteristics of the underlying " "hardware cache are the same for their virtual address space as they would be " "if the program had been run directly in a physical address space." msgstr "" #. (itstool) path: answer/para #: article.translate.xml:895 msgid "" "Note that I say reasonably contiguous rather than simply " "contiguous. From the point of view of a 128K direct mapped " "cache, the physical address 0 is the same as the physical address 128K. So " "two side-by-side pages in your virtual address space may wind up being " "offset 128K and offset 132K in physical memory, but could also easily be " "offset 128K and offset 4K in physical memory and still retain the same cache " "performance characteristics. So page-coloring does not " "have to assign truly contiguous pages of physical memory to contiguous pages " "of virtual memory, it just needs to make sure it assigns contiguous pages " "from the point of view of cache performance and operation." msgstr ""