USERS GUIDE AND TUTORIAL FOR PC-GenoGraphics Version I Ray Hagstrom, Ross Overbeek, and Morgan Price Argonne National Laboratory Argonne, IL 60439 CHAPTER 1 GETTING READY CHAPTER 2 INSTALLATION CHAPTER 3 USING PC-GenoGraphics AS A VIEWING TOOL CHAPTER 4 USING PC-GenoGraphics AS A DATA DISPLAY TOOL CHAPTER 5 USING PC-GenoGraphics AS A DATA SEARCH TOOL CHAPTER 6 SEARCHING SEQUENCE CHAPTER 7 USING PC-GenoGraphics AS AN INTERACTIVE LOGBOOK CHAPTER 8 USING PC-GenoGraphics AS A VISUAL REASONING TOOL FOR SEQUENCE CHAPTER 9 WORKING REPETITIVE PROCEDURES FROM SCRIPT FILES CHAPTER 10 PRINTING THINGS AND DOING OTHER THINGS CHAPTER 11 THE EASY WAY TO START YOUR OWN *.ALL FILE CHAPTER 12 GORY DETAILS ABOUT *.ZWD FILES AND *.ALL FILES AND *.UPD FILES FIGURE 1 Screen Requesting Identification of Best Graphics Mode. This screen comes up from GGSETUP after initial questions identifying the source and target disk names and directories. The correct response is to type D to allow you to investigate another possible description of the video adapter in your PC. You should keep trying new video adapters until you find the one which works best with your machine. FIGURE 2 Screen with Candidate Video Adapter Descriptions. Presented by GGSETUP after you have identified the video mode to be investigated. Use the up and down arrow keys to highlight your next guess at the identity of your video adapter. Hit to investigate the applicability of the highlighted mode. FIGURE 3 Screen Showing Performance of Video Mode under GGSETUP Investigation. Your screen should look very much like this if you are investigating some video mode which operates successfully on your machine. You should move the cursor into and out of the blinking square and the legend at the top should change. Black/white video modes will not produce the square of 16 brightly colored blocks. If the blocks are produced, there should be 16 distinct colors visible. YOU CAN ALWAYS EXIT FROM THIS SCREEN (EVEN IF THE DISPLAY IS INACCURATE) BY HITTING FIGURE 4 Second Screen Showing Performance of Video Mode under GGSETUP Investigation. These tiles are drawn shortly after the previous screen is exited. This screen will spontaneously disappear after a few seconds, and the user will be asked to confirm whether the video mode under investigation is performing adequately. If you answer 'N', the screen depicted in Figure 1 will reappear and you can make another trial. IF YOU ANSWER 'Y', INSTALLATION WILL PROCEED WITH THE PRESUMPTION THAT THE VIDEO MODE WHICH YOU JUST SAW IS THE BEST ONE. FIGURE 5 Full-screen Display of E.coli dataset distributed with some copies of PC-GenoGraphics. FIGURE 6 Full-screen Display of GDB human dataset distributed with some copies of PC-GenoGraphics. FIGURE 7 Full-screen Display of HIV virus dataset distributed with some copies of PC-GenoGraphics. FIGURE 8 Full-screen Display of AATEST dataset distributed with some copies of PC-GenoGraphics. This dataset is of no biological interest, but should be studied thoroughly to understand PC-GenoGraphics. FIGURE 9 Information Databox Attached to Object m3o1 in AATEST. To fetch this databox you should position the cursor over object m3o1, and click any mouse button. If you do not have a mouse attached to your PC, you can navigate the cursor by holding and pressing the arrow keys, the equivalent of clicking a mouse button is . CHAPTER 1 GETTING READY Introduction This short chapter tells you what sorts of resources you need to exploit PC-GenoGraphics and, perhaps more important for the beginner, what sorts of resources you do NOT need. You will find out what is desired on your PC well enough to be able to make an intelligent choice if you have more than one possible PC on which to mount PC-GenoGraphics. After completing this chapter you will be able to install PC-GenoGraphics onto your PC. PC-GenoGraphics is an integrated software package which allows what we call Visual Reasoning to be performed on genomic data which have been properly prepared. Included into PC-GenoGraphics is a set of properly prepared descriptions of some sample organisms together with tools which allow the user to custom-prepare whatever genomic data are at hand. There are, of course, two principal subdivisions of PC-GenoGraphics, the programs themselves (which manipulate the genomic data, but which contain zero specific information about any genome) and the data (which contain all of the information specific to the genome, but which cannot be visualized without the aid of the programs.) PC-GenoGraphics is designed to function as well as possible on the most primitive PC's, and is, furthermore, able to exploit the special capabilities of the most advanced units. The idea here is that you might consider running PC-GenoGraphics in the role of an Interactive Logbook visualizing and querying your personal data on some very primitive PC in your wetlab where graphics performance, etc. are not critical while you might consider running it on some high-powered machine to do complex queries to unified datasets concerning the organization of genomes as a whole. The amount of information which you can learn from PC-GenoGraphics rises with the quality of the PC you place it on. In general, you want to pick your PC with the best available color CRT screen and several MB of free disk storage space. MOST MODERN PCS ARE MORE THAN ADEQUATE TO RUN PC-GenoGraphics. IT IS RECOMMENDED AT THIS POINT THAT YOU SKIP IMMEDIATELY TO CHAPTER 2 AND TRY A BLIND INSTALLATION WITHOUT REFERRING TO THE FORMAL TECHNICAL DISCUSSION WHICH FOLLOWS BELOW, RETURNING TO SLUG OUT THE DETAILS ONLY IF THERE ARE PROBLEMS WITH THE BLIND INSTALLATION. System Requirements 1.) PC-GenoGraphics runs exclusively on DOS-based PC machines (IBM PCs or compatibles), these machines comprise about 90% of all computers in the world. Most notably, it does NOT run on Apple, Macintosh, or UNIX machines of any kind. If you have any doubt whether your intended machine carries DOS... The answer is certainly YES if it responds to your typing to the command line: VER with a message like: MS-DOS Version 4.01 The answer is certainly YES if your machine is running Microsoft Windows. The answer is most likely YES if you have programs like Lotus-123, Excel, Paradox, WordPerfect, Word, DB, Harvard Graphics, Applause, Hollywood, Quattro, WordStar, Corel Draw, etc. If you are still in doubt, look at the package from any software which is installed on the PC, and if it is intended for IBMs, Compatibles, or Clones, then the answer is certainly YES. 2.) PC-GenoGraphics is intended to run on the vast majority of DOS-based PC machines, but it has only been tested thoroughly on machines with PC-DOS 3.30, MS-DOS 4.01, and DR-DOS 6.0 We believe that PC-GenoGraphics should run on machines which have the following operating system installations: MS-DOS Version 3.** MS-DOS Version 4.** MS-DOS Version 5.** MS-DOS Version 6.** PC-DOS Version 3.** PC-DOS Version 4.** PC-DOS Version 5.** PC-DOS Version 6.** DR-DOS Version 5.** DR-DOS Version 6.** The * symbols above mean "wildcard" and can be matched by any number, thus "MS-DOS Version 4.01" is OK while "MS-DOS Version 2.10" is not likely to work with PC-GenoGraphics. To find out exactly what DOS version you have installed on your PC you must get to the DOS prompt and type: VER 3.) While a hard disk is not strictly required for PC-GenoGraphics to operate in principle, the fact is that operating strictly from floppies makes performance so slow as to compromise scientific utility for all but the very smallest genomes. It is, we think, practical to do only the most modest Interactive Logbook functions without a hard-disk. If you are interested in getting a floppies-only distribution, get in touch with us. 4.) A certain amount of RAM is required to work PC-GenoGraphics as well, the present minimum is about 520K. You can always tell how much free RAM there is by getting to the DOS prompt and typing: CHKDSK The response will look something like this: Volume SYSTEM DISK created 06-23-1989 12:28p Volume Serial Number is 180A-1A2E 33431552 bytes total disk space 73728 bytes in 3 hidden files 174080 bytes in 77 directories 30570496 bytes in 1539 user files 266240 bytes in bad sectors 2347008 bytes available on disk 2048 bytes in each allocation unit 16324 total allocation units on disk 1146 available allocation units on disk 653312 total bytes memory 549248 bytes free Your "bytes free" line is the amount of available RAM. If you do not actually have enough "bytes free", you can ALWAYS get nearly the amount shown in your "total bytes memory" line by the following procedure or a simple variant: You get to the DOS prompt and move to the boot directory, usually by typing: CD C:\ C: Next YOU MUST SAVE the system files: COPY CONFIG.SYS CONFIG.OLD COPY AUTOEXEC.BAT AUTOEXEC.OLD Next create new system files: COPY CON > CONFIG.GG FILES = 15 BUFFERS = 15 COPY CON > AUTOEXEC.GG COPY CONFIG.GG CONFIG.SYS COPY AUTOEXEC.GG AUTOEXEC.BAT Next re-boot your system: and, when everything is settled down again the CHKDSK command should reveal the largest practical amount of RAM which can be accessed on your machine. If this new amount is adequate, you should work out a compromise with your collaborators to "kick out as much of the resident codes and devices as is required to free up the memory"... your negotiating position starts from here: The only devices which you can even use while running PC-GenoGraphics are a so-called "disk-caching routine" such as SUPERPCK, Vcache, Cache86, DCACHE, Lightning, or FAST TRAX (all of which are recommended for speed) or a mouse (which is optional at that); absolutely no TSR of any kind is required for PC-GenoGraphics. If, on the other hand, the new value is still inadequate to meet the requirements of PC-GenoGraphics, you must try another machine. IN ANY EVENT, YOU SHOULD REVERSE THE REVISION OF YOUR OPERATING SYSTEM FILES BY TYPING: COPY CONFIG.OLD CONFIG.SYS COPY AUTOEXEC.OLD AUTOEXEC.BAT 5.) Finally, there is the question of CRT display/hardcopy graphics performance. Virtually any graphics monitor and adapter combination should work at some level with PC-GenoGraphics. Most graphics monitor and adapter combinations (other than those sold directly on IBM brand PC's) can work at truly superior performance if you know how to describe your combination during the GGSETUP procedure for PC-GenoGraphics. Most graphics monitor and adapter combinations can be exploited fully by PC-GenoGraphics. Our GGSETUP procedure is sufficiently robust so that you can figure out by trial and error what graphics monitor and adapter combination you actually have. Your installation will go more smoothly, however, if you take the time to look up the brand of whatever video adapter card is present in your machine, and, best of all, determine how much video-RAM is present on your board. IF YOU HAVE A TRUE IBM BRAND MACHINE, TO WHICH NO UPGRADE OF THE VIDEO ADAPTER CARD HAS BEEN MADE, YOU WILL BE STUCK WITH ONE OF THE STANDARD VIDEO PROTOCOLS AND THERE WILL BE NO USE IN TRYING TO GET IMPROVED VIDEO PERFORMANCE; THE GOOD NEWS PART OF THIS IS THAT YOUR INSTALLATION PROCEDURE WILL BE SIMPLE. Summary of Requirements: 1.)IBM PC, XT, AT, PS/2, or 100% compatible 2.)DOS version 3 or later 3.)520K Minimum free RAM 4.)Some CRT graphics monitor and video adapter (it will help, but it is not necessary to know which kind you have) 5.)Floppy disk drive to load distribution disks Summary of Recommended Capablilties 1.)Hard disk with several MB free space 2.)Disk cache software such as SUPERPCK, Vcache, Cache86, DCACHE, Lightning, or FAST TRAX. 3.)Good quality color graphics such as SVGA 1024x768 color display with at least 256KB V-RAM; 1MB is best. 4.)At least 1MB of RAM with EMM driver installed in the system. 5.)Mouse with driver installed into system 6.)Laser Printer (HP-LJ2 or compatible) As of this writing, complete systems with more than ample computing speed (16 MHz 386SX) and full graphics capabilities are selling for almost exactly $1000 at agressive mail-order distributers while laser printers are going for about $600. Summary of what is NOT REQUIRED and does NOT provide ANY advantage to the operation of PC-GenoGraphics: 1.)Microsoft Windows 2.)Network Connections of any kind 3.)TSR routines of any kind (except disk caching software) 4.)Optical Disk Drive 5.)IBM's 8514A video adapter CHAPTER 2 INSTALLATION This chapter tells how to move PC-GenoGraphics from the distribution floppy disks onto your PC. After completing this installation procedure, you will be ready to operate PC-GenoGraphics at its full range of query and display functions. MOST EXPERIENCED USERS WILL KNOW HOW TO MOUNT PC-GenoGraphics WITHOUT GOING THROUGH THIS CHAPTER. INSERT DISK1 INTO YOUR FLOPPY DRIVE, SET DEFAULT TO THAT DRIVE, AND TYPE GGSETUP. IF THIS PROCEEDS REASONABLY, YOU MAY SKIP IMMEDIATELY TO CHAPTER 3 AND RESUME THE TUTORIAL. The PC-GenoGraphics distribution contains programs and data. The programs allow visualization of whatever datasets are loaded, but the programs have no information specific to any organism. All organism-specific information is contained within the datasets. Datasets for tiny genomes such as viri require less disk space than the programs. Datasets for larger organisms such as E.coli and humans are much larger than the programs. The user exercises discretion over which datasets to install. If available disk storage space is a limitation for your PC you may choose to omit the larger datasets being installed. To install PC-GenoGraphics: 1.)Start your PC 2.)Get to the DOS prompt (totally exit from Windows or any other shell, if present) so that you get a prompt like this: C:> 3.)Place the distribution disk #1 into a proper floppy drive (typically drive A: or drive B:). If you are in doubt as to which letter applies, type DIR A: and if the little light on the drive with the disk inserted lights up, you have disk A:. If not, you may get the abominable DOS message: Not ready reading drive A Abort, Retry, Fail? in which event you should type A to abort and then try DIR B: etc. until you have located the proper floppy drive name. WE WILL CONTINUE THIS DISCUSSION AS IF THE INSTALLATION HAPPENS FROM DRIVE B:, but you must use the correct letter from your system. 4.)Next we will establish what part of the hard-disk (typically C:, D:, or E:) will take the installed copy of PC-GenoGraphics. To find out how much free disk space exists on hard-drive C: type CD C:\ C: CHKDSK The response should look roughly like this: Volume SYSTEM DISK created 06-23-1989 12:28p Volume Serial Number is 180A-1A2E 33431552 bytes total disk space 73728 bytes in 3 hidden files 174080 bytes in 77 directories 30570496 bytes in 1539 user files 266240 bytes in bad sectors 2347008 bytes available on disk 2048 bytes in each allocation unit 16324 total allocation units on disk 1146 available allocation units on disk 653312 total bytes memory 549248 bytes free Here, the crucial line is the one labelled "bytes available on disk". The exact amount of disk space available which you require depends strongly upon how you plan to use PC-GenoGraphics, but the wisest choice is to repeat the above procedure for D:, E;, etc. and choose the one with the largest "bytes available on disk". REMEMBER THIS LETTER. In any event you will want at least 1 megabyte for the minimal installation of PC-GenoGraphics, at least 2 MB more for the minimum with E.coli, 4 MB more for the minimum with the GDB Human maps, and 14 MB more for the complete human system including sequence. If you get a message like: Invalid drive specification In reponse to the above procedure, that is a sign that there is no opportunity to use that letter at all. The most common situation is for C: to be the only useable letter at all. If "bytes available on disk" is less than your desired number of MB for all available letters, you can still always get PC-GenoGraphics to run by deleting un-needed files from the hard-drive. These do not need to be lost permanently provided you use the BACKUP command to save them to floppies first. BACKUP and ERASE files from the hard-disk until the CHKDSK command shows enough "bytes available on disk". 5.)Now type B: GGSETUP At this point you will start the installation procedure. There will be the usual sorts of queries asking for the drive designator (typically A or B) for the source floppy, the destination hard drive designator for the installation (typically C or D), and the destination subdirectory (typically GG). The meaning of the destination subdirectory is that all programs and distribution data will be installed into the disk region which is reached by CD C:\GG C: or CD D:\GG D: After this comes a somewhat mysterious question about "high-contrast" displays. The answer to this is almost always no, "N". The circumstances which indicate a "yes" answer are when you are using a PC which has a "color" display, but which exhibits these "colors" only as distinct shades of gray or actual color displays such a transparency projection panels where dark colors are difficult to differentiate against bad lighting conditions. Usually laptop computers or other systems with high-quality LCD panel displays use this standard. What does NOT indicate a "yes" answer is a true monochrome display (one in which only two colors are present, black&white or black&green or black&amber, for instance) Now come three informative screens containing abbreviated information about the installation and operation of PC-GenoGraphics. Each of these requires an answer of 'Y' before progress continues. Next will come up the graphic devices query panel shown in Fig.1 In order to identify the best performing graphics mode, you will select menu item "D". This will open up a large, multi-screen display, like the one show in Fig.2, which you navigate using the arrow keys until a promising video mode is highlighted. You then select it with the key. After a few seconds, a small box from MetaWindows will appear in the middle of your screen, to be followed a few seconds later with a display like that shown in Fig.3. AT THIS POINT, YOU CAN ALMOST ALWAYS CONTINUE THE INSTALLATION BY SIMPLY HITTING: Y AND PICKING UP THE TUTORIAL AT CHAPTER 3. The only opportunity you would be passing up would be that you would temporarily be stuck with operating merely at the best IBM-STANDARD video mode accessible to your machine. It may be that your machine's best operation is no better than this, but almost all machines less than 5 years old (except those with the IBM brand) perform much better than this. You can, of course, always operate with the default mode, re-installing to get the best performance from your machine at some later time. If the candidate video mode you have selected is sufficiently far from the hardware configuration actually present on your machine, you may have a totally different presentation, perhaps a uniform blank screen, perhaps a chaotic pattern of blinking patterns. In this latter case, you want to return to select another candidate video mode...this can almost always be done by hitting , the visual image may start fluttering a bit and the menu will come back in a few more seconds. In the most extreme cases of unfortunate guesses at the identity of your video adapter, your PC can actually "lock up", requiring a re-boot...either simply enter the keyboard command , or hit the PC control panel button "Reset", or turn the power to the PC off and then back on. Your goal here is to get that video mode which produces a proper screen like Fig.3 and which has the highest resolution. A mediocre resolution is 320x240; a high resolution is 1024x768. The number of colors desired is at least 16 (unless you actually have a Black&White monitor). NO BENEFIT IS OBTAINED from selecting a 256 color mode over a 16 color mode, so that you should take any other choice with 16 colors that has higher resolution. You are allowed arbitraily many attempts to find the best performing video mode. At this point, you should test the mouse (if any) to verify that it moves the cursor. How to Wake up your Mouse if it is Dead: A MOUSE OR TRACKBALL POINTING DEVICE IS NOT REQUIRED TO USE PC-GenoGraphics, BUT IT FACILITATES ACCESS SIGNIFICANTLY, ESPECIALLY FOR THE LESS COMPUTER-EXPERIENCED USER. If the cursor does not respond to mouse movement, you will need to activate the mouse and/or install the mouse into your system. First, exit GGSETUP by hitting and when the query about endorsing the performance comes up, hit again. When the DOS prompt appears, type whatever command activates the mouse on your system, typically: MOUSE Some "feel-good" message about "mouse successfully installed" should come up. If not, or if after repeating the GGSETUP procedure, testing the mouse still yields no response, you will want to install the mouse onto your system. This is done by editting the infamous file, CONFIG.SYS to insert a line directing that the description of the mouse be included into your operating system. The new line is typically something like: DEVICE = C:\DOS\MSMOUSE.SYS or DEVICE = C:\DOS\PS2MOUSE.SYS After editing CONFIG.SYS, you must re-boot your machine, to place the changes into effect. Once you have found the best performing video mode, you answer "Y" to the question about endorsing the performance and the rest of the installation is straightforward: The only remaining options are to decide which of the datasets included in the distribution are to be installed on your hard-disk. These are straightforward "Y" or "N" answers which are dictated by your needs and the disk space available on your disk. The datasets presently distributed with PC-GenoGraphics are: GDB Mapping data from the Human Genome as of HGM-11 HC21 Mapping data plus sequence data for Human Choromosome 21 ECOLI_R1 Kenn Rudd's curated version of Kohara's famous mapping of E.coli together with sequence placed on the map. At this point PC-GenoGraphics should be installed onto your system. Pitfalls which indicate to the contrary include odious messages about your disk being full. Obviously, nothing PC-GenoGraphics can do will relieve disk space congestion, and we recommend using BACKUP and ERASE to move unecessary files off your hard-disk. Once you have cleared enough disk space, simply start our PC-GenoGraphics installation procedure over again from the beginning. There are things which you can do to relieve disk crowding which do not require deleting any files. There are so-called disk-compression programs such as DiskMax and SuperStore which are avaiable at any reasonable PC software vendor, typically priced in the $50 range or cheaper. These typically allow twice as much information to be stored on any given hard-disk with at worst modest loss of performance. CHAPTER 3 USING PC-GenoGraphics AS A VIEWING TOOL This chapter teaches you how to select a chromosome for viewing and how to enter elementary commands to set the field of view and how to exit PC-GenoGraphics. PC-GenoGraphics has comprehensive built-in context- sensitive help. The bottom of the PC screen always displays some context-sensitive hint. In addition, errors cause more extensive help messages to appear. Every menu has a help item attached. EXPERIENCED USERS MAY BE ABLE TO DO THIS ON THEIR OWN AND MIGHT RESUME THE TUTORIAL AT CHAPTER 4. The most elementary use for PC-GenoGraphics is to visualize genomic information. The basic technique in dealing with genomic data is to take advantage of the linear organization of chromosomes. We will universally represent the length of the genome as a horizontal stripe across most of the video display with 5' at the left and 3' at the right of the display. Various information may be represented by different stripes at distinct heights on the screen, but they all cover the same horizontal space. Fig.5 provides a specific example of data included into Kenn Rudd's compilation of data on the well-mapped E.coli organism, Fig.6 provides a specific example of data present in GDB and GenBank concerning the much less-accurately mapped human genome, while Fig.7 provides a specific example of the completely sequenced genome of an HIV virus. You will notice immediately the principal obstacle in graphic display of genomic data: Even the tiny virus has nearly ten thousand basepairs in its genome, while even most advanced graphics displays have less than one thousand pixels avaible to display this length of data. The time-honored method of dealing with this sort of discrepancy is by enabling the user to "zoom" in to fine scales when fine detail is required while still being able to search around for data which may be "offscreen" because of zooming activity. PC-GenoGraphics has extensive zooming and panning capabilities which comprise the bulk of the pure visualization manipulation available to the user. We will explore how to use the zooming capability of PC-GenoGraphics by explicit tutorial example on the artificial system AATEST which is included in all distributions of PC-GenoGraphics. First, we need to get AATEST onto the screen. 1.)Activate PC-GenoGraphics: CD C:\GG (or whatever you actually used) C: GG At this point a screen with an open menu should appear. BELOW, KEYSTROKES ARE GIVEN, BUT YOU CAN (AND SHOULD) DO EXACTLY THE SAME THINGS WITH THE MOUSE (IF YOU HAVE ONE) BY POINTING AT THE LABELLED BUTTONS OR MENU ITEMS WHICH HAVE THE APPROPRIATE CAPITALIZED AND UNDERLINED LETTERS IN THEIR LABELS: 2.)Choose file to view: The following command picks the default directory and then the top file named in the list and chooses not to load any "Update File". It also chooses to load all of the maps in the selected file. G N G At this point a screen looking like Fig.8 should appear. You may now skip down to item 3.) and continue the tutorial without loss. In general, this procedure can be modified by standard "look and feel" techniques to alter the selected conditions: D:\MYDIR\MYSUB replaces the first command above if you want to look at files in that particular subdirectory. To select other than the first file in the list, you must promote your desired filename to lie within the small box above the list. This is done by navigating the cursor to point at the desired filename within the large box (using the slider bar at the right to scroll the list if necessary) and clicking the cursor on the desired filename. When the desired filename has been promoted, it is endorsed by hitting the "Go" button: G After this, a list of potential update files is presented and the desired update filename, if any, is likewise promoted and is endorsed by hitting the "Go" button: G If no update file is desired, simply hit the "None" button: N If a new update file is desired, hit the "Anew" button and type in its name: A newname At this point you will be presented with a list of maps which are contained in the file you have selected. These represent unified classes of data items attached to that file. Any combination of these maps (other than none at all) may be selected for viewing. Unselected maps play no further role in the session. You can always reinitiate this entire file selection procedure by invoking the File option in the Files menu. F F 3.)Practice Keyboard Zoom: Z Z K 0.20 0.60 Notice that this zooms in on the range [0.20,0.60] where the whole possible length of the "chromosome" under study is always [0.0,1.0]. Obviously, keyboard entry is the most cumbersome and the most precise mechanism for specifying a zoomed field of view. Try one or two more keyboard zooms of your own choosing. Notice two features of the zoomed views: First, that irregular shapes such as triangles are always drawn to fit within the boundaries of the screen quite regardless of whether their full length hangs outside the screen width. Second, that when a shape hangs outside the screen width, this fact is indicated by the addition of a "continuation arrow" drawn off-scale (in cerise on color displays, elsewise in black) in the direction of the overflow. This protocol obtains no matter what method is used to invoke the zoomed view. 4.)Practice Unzoom: Z U Each time you do this you should recover from memory the zoomed view you had once further in the past (up to a maximum of 20). Notice on the menu where the Unzoom option is finally represented, that there is a "U" in parentheses. This means that you actually did not need to invoke the menu explicitly, rather, "U" is a so-called "hot-key" and that you could have typed a "U" while viewing the screen when no menus of any kind obtrude into the viewing area and the Unzoom would have been implemented. 5.)Practice Rezoom Z R This precisely undoes the action of the last Unzoom. "R" is the hot-key for this command. 6.)Practice fullView zoom Z V This always shows the full view regardless of previous zoom history. "V" is the hot-key. 7.)Practice Mouse zoom Z Z M Notice that when the item is selected, the cursor shape changes from an arrowhead to a cross...THIS IS HOW YOU CAN TELL IF A MOUSE EVENT IS AWAITED TO DEFINE A ZOOM WINDOW. When the cross cursor is obtained, you can move the cursor to one side of the range which you want to view, click the mouse button once, move the cursor to the other side of the range which you want to view, and click the mouse once more. "M" is the hot-key for this option. ("M" is the most important hot-key to remember, by far). IF YOU DO NOT HAVE A MOUSE: You can still activate mouse functions (although in a somewhat cumbersome manner) by holding the key while pressing the arrow keys for motion or holding the key while hitting the key instead of pressing the mouse button. 8.)Practice panning: First zoom to full screen: V Notice the little box with horizontal stripes in the menu bar at the top of your screen. Watch this box while you zoom in to the range [0.20,0.60] using keyboard zoom as above: Z Z K 0.20 0.60 Notice that the bright stripes (which represent the fraction of the whole genome visible at present) have been narrowed, being partially blacked out on the left and on the right. Notice that 20% of the left of the stripes is obscured and 40% of the right. This protocol obtains no matter what method is used to invoke the zoomed view. Now pan to the right (hot-key ): Z Z R The field under view should jump over 25% of its width to the right. Repeated hits will walk the length of the chromosome. Now pan to the left (hot-key ): Z Z L To exit PC-GenoGraphics, point at the large button in the upper right of the display with your mouse and click once or simply type the key-stroke whenever no menu items are obtruding into you screen. CHAPTER 4 USING PC-GenoGraphics AS A DATA DISPLAY TOOL This chapter teaches you how to visualize the data-box which is attached to visual objects in the PC-GenoGraphics screen. This activity is fairly complicated and all users are recommended to go through this part of the tutorial. After this chapter, you should be able to call up any data-box, to view all of the data which it contains, and to search for specific information within the data-box. The visual displays in PC-GenoGraphics are organized horizontally as copies of the genome under study. Different data are organized into vertically distinct stripes running the width of the screen. These horizontal stripes are segregated into "maps" each of which consists of some number of what we call "submaps". Placed upon the length of each submap are various blobs (which we call "objects") that represent parts of the data. The logical scheme is that objects which are similar are all present on a single map and distributed among enough submaps in that map to allow all objects so as to avoid actual overlap (simply sharing edges is not an actual overlap). THUS, EACH DATUM IS ASSOCIATED WITH AN OBJECT WHICH DEFINES THE LOCATION OF THE DATUM ALONG THE LENGTH OF THE CHROMOSOME, AND EACH OBJECT IS ASSIGNED TO SOME PARTICULAR MAP AND SUBMAP TO FACILITATE IDENTIFYING ITS SIGNIFICANCE AND TO ALLOW ITS VISUAL DISTINGUISHABILITY, RESPECTIVELY. Let us see how to recover the data associated with objects on our visual display. We will again use AATEST: 1.)Activate PC-GenoGraphics: CD C:\GG (or whatever you actually used) C: GG At this point a screen with an open menu should appear. BELOW, KEYSTROKES ARE GIVEN, BUT YOU CAN (AND SHOULD) DO EXACTLY THE SAME THINGS WITH THE MOUSE (IF YOU HAVE ONE) BY POINTING AT THE LABELLED BUTTONS OR MENU ITEMS WHICH HAVE THE APPROPRIATE CAPITALIZED AND UNDERLINED LETTERS IN THEIR LABELS: 2.)Choose the file AATEST to view: G N G V At this point a screen looking like Fig.8 should appear. Notice that there are three seperate maps, labelled "DNA1", "RNA1", and "PEP". The first two maps have one submap each while the third map has three submaps. 3.)Now using the mouse (or the combination of and arrow keys to navigate the cursor to point at the medium sized, triangular object labelled m3o1 in the bottom left quadrant of your screen. Once you have the cursor positioned on this object, press the mouse button (or the combination of and ). This will activate the data attached to that object to be displayed superimposed over the visual image of the maps, submaps, and objects. The screen should look something like Fig.9. 4.)Examine the head-line at the top of the new data-box: This contains the complete name and location of the selected object. NOTICE THAT THE COORDINATES WHICH DEFINE THE POSITION OF THE VIEWED OBJECT ARE IN UNITS IN WHICH THE ENTIRE RANGE OF THE GENOME UNDER CONSIDERATION IS EXACTLY [0.00 , 1.00]. 5.)The large scrollable field of several text lines which dominates the data-box contains arbitrary information which is attached to the chosen object. Because of the wide range of possiblities, a number of controls are present to allow you to navigate these data. The most elementary form of motion through the text in the data-box is by "grabbing" the slider handle (the light part) in the slider control to the right in the data-box, holding with mouse button down, and pulling the slider bar upwards or downwards on the screen by moving the mouse position. The text will scroll in response to this action. THIS IS THE MOST CUMBERSOME ACTIVITY IMPOSED UPON USERS LACKING A MOUSE because they must use the arrow keys to move the highlight to the top or bottom of the large text block, arrow presses which would take the highlight off screen scroll the text block one line. From left to right along the bottom of the data-box are buttons: Move This allows the user to look through the data-box and to reposition it if desired to get visual access to the underlying maps. To return the data-box to visibility, hit any mouse key or hit . YOU CANNOT CONTINUE ANY ACTION OTHER THAN MOVING THE DATA-BOX UNTIL IT IS RE-MATERIALIZED IN THIS WAY. << >> Notice that the text in the large block is organized in a special hierarchy. Each line in the display is either left-justified or it has a leading blank character. Those lines which are left-justified have, in general, some number of leading ">" characters ranging from zero to three. All lines with leading blanks are to be thought of as continuation lines of the preceeding left-justified line. The notion here is that data with one or zero ">" characters in their left-justified line are the most salient and should always be visible. Data with two such ">" characters are intermediate in saliency while large datasets (such as sequence) which are to be viewed rarely have left-justified lines with three ">" characters. Notice that as the data-box comes up, the "<" button is initially "ghosted out". This means that the "<" button is not pressable. If you press the ">" button once, however, data at the intermediate level of saliency will be interleaved into the display. Hitting ">" again will bring up the sequence data attached to this object. In general, ">" will bring up more detailed versions of the data while "<" will bring up more condensed versions. pgUp pgDn At whatever ">" level you are viewing the data in a data-box, it is possible that there are too many data to be viewed as a unified text even when the scrollbar to the right of the text-block is used. In this case, the resulting text is paginated, and you can move forward one page with "D" and backwards one page with "U". Add Discussion of this feature will be deferred to Chapter 7. Find This button activates searching of the data contained in the data-box under view. To demonstrate usage of "Find", we first will get to the top line at the highest level of saliency: < < U U U Now initiate the search: F ELVIS This will find all occurences of the word "ELVIS" present in the comment as viewed at the most salient level. Notice that there are none, and that you are notified of this lamentable fact by a briefly appearing informative text box. Next we will drop down to the bottom level of saliency and try again: > > U U U F ELVIS Notice that this important peptide feature does appear in our model dataset. Notice also that the top line settles on the first occurence of "ELVIS" in the dataset under investigation. Here is how to find the next occurence (if any) of "ELVIS": Navigate the cursor to light up the second line of the large display area in the data-box, and click a mouse button or , then F ELVIS To find yet another occurence of ELVIS, repeat the above procedure, this time allowing the search to extend across multiple pages which would require you to issue the command "D" for access: F ELVIS O Notice that the last command orders the search to span the pagination of the text and the next occurence of "ELVIS" would be located, although there are no more in this particular data-box. dumP You may want to preserve a copy of certain data contained within a data-box. This command saves the entire contents of all pages of the present data text which would be visible at the present level of ">" and "<" setting. You are queried for the name of the destination file. P MORGAN.TXT Quit This closes the data-box and allows you to continue other activities on the graphics viewing screen. YOU CANNOT PERFORM ANY ACTIVITIES OUTSIDE THOSE IN THE OPEN DATA-BOX UNTIL YOU HAVE CLOSED IT BY HITTING THE Quit BUTTON. Q CHAPTER 5 USING PC-GenoGraphics AS A DATA SEARCH TOOL This chapter teaches the user how to identify which PC-GenoGraphics objects are interesting on the basis of their names or the text contents of their data-boxes. After completing this chapter, the user should know how to select sets of objects, how to tell which are selected, how to concentrate on selected objects, and how to search keyword-indexed text data attached to any objects. The user will also know how to restrict the range of searches. This is specialized material and is recommended for all users. So far, precious little we have done is at all specific to the actual data attached to the various objects in our display. In fact, a number of data searching tools are included into PC-GenoGraphics which allow considerable opportunities to probe the data and to generate visual displays in response to those queries. This class of capabilities elevates PC-GenoGraphics above the rank of a mere visualization tool. Selecting Objects By Name Our most primitive class of data search is what we call "selecting" some class of objects. This selection process elevates objects to a more visible status and makes them more readily addressable. Let us start with the simplest class of object selection, selection by Name. Our first exercise will be to select two objects by name. Invoke PC-GenoGraphics, and select the file AATEST.ALL, then type: S N m1o3 m2o2 Notice that there are two lower-case "o" (not zero) characters in the above. Notice that the two selected objects blink on the screen. Notice also that, after the blinking has settled down, the colors of selected objects are inverted from what they would have been had they not been selected. You can always get the selected objects visually to reveal themselves without the trouble of re-drawing the whole screen by typing the hot-key "B" whenever the viewing screen is unobstructed by menus or information boxes. Note that the every other time you hit the hot-key "B", the final state of the selected objects is reversed from the previous final state. Of course, the file AATEST describes objects which are so large on the screen that their names are clearly legible, but more realistic biological systems do not maintain this convenience and selecting objects by name will point out the interesting regions of the chromosome for further investigation. Zooming on to Selected Objects Now, while two objects are selected, we can exploit their status and zoom in on them. Issue the commands: Z Z S This zooms in to the smallest screen which will contain ALL of the selected objects. Notice, of course, that the left and right ends of the screen both have selected objects butting up to them. Now issue the commands Z Z N This zooms in so that the full width of the screen is spanned by the Next selected object (next in left to right order), m2o2. This valuable zoom action has hot-key "N". Issuing this command sequence again: Z Z N will zoom in on object m1o3. Similarly, the command sequence Z Z P zooms in so that the full width of the screen is spanned by the Previous selected object, m2o2. Here, the hot-key is "P". This combination of selecting objects by name followed by ZZN, ZZP command sequences facilitates access to the data-boxes for objects whose name is known. Unselecting Previously Selected Objects If you continue to select more objects, say, by issuing the commands S N m3o3 the newly selected objects are simply appended to the previous list of selected objects. If you wish to start over, de-selecting all previously selected objects, the commands are: S U Notice that issuing this "Unselect" command does not fold up its underlying menu, instead leaving it open to continue further selection activity. If you simply wish to close this menu box and continue with other activities, the menu command is "Back" B THIS PROTOCOL IS USED ON ALL MENUS IN PC-GenoGraphics. Issue the following commands now, before continuing to the next step in this tutorial. B V S U B Selecting Objects by Contents Of course, we do not always know the names of objects which contain sought information, we usually know something else about the information. This sort of search is what we call "selecting objects by contents". This is facilitated by a two-level hierarchical index structure for the bulk of data attached to objects. Recall from the description of the organization of the large text-block in a data-box that each line of the text-block is attached to a header line (the preceeding left justified line) which has some number of ">" characters at its head followed immediately by some arbitrary "keyword". Let us start by searching the file AATEST for all objects which contain information under keyword GOLD and which refer to LONDON somewhere else in the chosen header line or its subsequent continuation lines. S C Gold LONDON A The last menu with three pushbuttons (All, Could, Must) defines the range of the genome to search for the desired object. "All" means that the entire genome should be searched regardless of what is visible on the screen. "Must" means to limit the search only to objects which are totally contained within the present screen image. "Could" means to limit the search only to objects of which any part is visible on the present screen. These directives are quite useful in providing PC-GenoGraphics intellectual guidance in speeding up its searching procedures. Notice that three objects are selected by this search, m1o2, m1o3, and m3o1. They all contain news of gold prices in London. Point the cursor at m1o2, one of these selected objects, and click to reveal the underlying data-box. You will notice the line ">GOLD" at the top level of saliency, but no mention of London is visible at this top level...this is not the source of our successful selection. Hit ">" once, and a comment headed by the line ">>GOLD_PRICES" will appear; this is the line whose contination lines actually refer to London, and this is the source of our successful selection. Notice that the keyword found need not be a complete precise match to the keyword sought, but rather that the found keyword can have more letters following after a prefix which is a precise, case-insensitive, match to the sought keyword. I.e. ">>GOLDBUG" will provide a match to sought keywords "GoLd", "goldb", "G" , or "", for that matter, but it will NOT match sought keywords "GOLDBAG", "GOLDBUGS", or ">>GOLDBUG". Likewise for the sought string: We could find all objects with any keyword starting with "GOLD" by commands: S U C Gold A while we could find all objects which mention "GOLD" (and "Goldfarb", and "Rhinegold", etc. etc.) QUITE REGARDLESS OF WHAT THEIR HEADLINE SAYS by commands: S U C GOLD A If the data attached attached to objects are intellegently organized with keywords and regularized spelling, one can already attain considerable navigability in databases which describe genomes where exact sequence placement is not yet important. A beautiful query to do with our distribution of the Human Genome data from GDB (files GDB or HUMAN, if you have them) is to visualize all data relevant to zinc-fingers (more precisely, relevant to zinc in any way) by the commands S C zinc A Another such query, this time against the E.coli database (in files ECOLI_R1 or ECOLI_H1, if you have them) is to locate all objects containing reference to phages: S C phage A The appropriate sites light up along the genome, and are accessible to further investigation using the ZZN and ZZP commands to track down the interesting objects, clicking the cursor on those objects to bring up their data-boxes, and using ">" and "F" commands automatically to scan the text-blocks for the relevant information. CHAPTER 6 SEARCHING SEQUENCE This chapter teaches the user how to search arbitrary objects for a wide range of possibly imprecise sequence patterns. The user will learn an entire language to define such queries, which we call "punits". The user will learn how to identify whether DNA, RNA, or peptides are to be searched, how to restrict the range of searches, and how to formulate queries. This material will be new to all users. So far, there has been no mention of specializations in PC-GenoGraphics which reflect the genomic nature of the data. In fact, it is a general rule that "PC-GenoGraphics knows nothing about genomes". In particular, PC-GenoGraphics has no explicit understanding of the concept of basepairs, or of genes, etc. PC-GenoGraphics does understand keywords and their attached comment lines, and these more universal constructs must serve to convey the more specialized meanings. The exception to this rule is for sequence data. We have implemented versatile and efficient search mechanisms for sequence data and learning how to manipulate these tools will open to many users their highest level of exploitation of PC-GenoGraphics. Before getting down to techniques, let us examine sequence data attached to a data box: To do this, load the file AATEST into PC-GenoGraphics and click the cursor on the object labelled "m2o1" which is a striped rounded rectangle in the left half of the screen inside the the large box (map) labelled "RNA1". The text-block should be viewed at the lowest level of saliency: > > and you should see a small block of RNA sequence displayed within the text-block...use the or scrollbar if necessary. Notice how this sequence is displayed: It has a header line that is special. The actual sequence is attached to a line which looks like this >>>X 985795 The special character represented above as an X probably looks somewhat different on your screen. The preceeding line is also related to the sequence data, but it is more normal and has a higher level of saliency (we always like to have this be at the most salient level) and looks like this: >SEQ_RNA start: 985795 end: 985984 Of course this line is visible at all levels of saliency while the unusual line (and the actual sequence attached to it) is visible only at the lowest level of saliency. Notice that the sequence contains the subsequence UUUUGUUCAG in its third block of ten basepairs. Let us rediscover this subsequence with a simple sequence search. The Simplest Sequence Search Quit the data-box (command "Q") and invoke the commands: Q U Q UUUUGUUCAG C A The last menu with three pushbuttons (All, Could, Must) defines the range of the genome to search for the desired object. "All" means that the entire genome should be searched regardless of what is visible on the screen. "Must" means to limit the search only to objects which are totally contained within the present screen image. "Could" means to limit the search only to objects of which any part is visible on the present screen. These directives are quite useful in providing PC-GenoGraphics intellectual guidance in speeding up its searching procedures. The next to last menu with two pushbuttons (Ok and Cancel) defines whether overlapping matches are regarded as distinct. For instance: searching for AGA in the sequence GCAGAGA will yield only the first match if "Cancel" is issued, but will yield both if "Ok" is issued. The above finds all occurences of this subsequence on the forward strand of all DNA or RNA sequence entries in the file AATEST. Notice the information box which appears monitoring progress of your search. We will come to the proper use of this box later. When the search is done, the information box will disappear, the screen will be re-drawn and a new gray bar running the height of the display will appear for each match. Needless to say, the left-to-right position of this bar corresponds to the location of the sought subsequence relative to the length of its containing object and that object's placement along the map to which it is assigned. Although this subregion is not associated with any particular object previously present on the screen (notice in particular that the object m2o1 is not blinking), the gray stripe itself is a new object which is "selected". Thus, for instance, the command to zoom to the next selected object: Z Z N will fill the screen at just the position of the matching subsequence and, of course, the whole screen will be grayed as well. More Complicated Searches for DNA/RNA Patterns The previous search was for a precisely known subsequence; rarely do we know precisely what we seek! PC-GenoGraphics offers a wide variety of imprecise searching options. These comprise a small language which is directly based upon that created by Searle: Each DNA/RNA sequence query is structured as a set of text blobs (we call them "punits") seperated by characters. The notion is that a successful match for the whole query requires a consecutive set of subsequences each of which matches the consecutive punits of the query. The simplest example above has only one punit which was "UUUUGUUCAG". In general, we allow punits to be defined in several ways: 1.)Explicit Sequence With Ambiguity Codes. We would have matched the same subsequence as above (and possibly other sequences as well) if the query were for YYYYRYYYRR or UUNUGNUNAG. Our complete list of RNA/DNA sequence ambiguity codes follows: +-----+------+ |Code | Match| +-----+------+ | A | A | | B | CGT | | C | C | | D | A GT | | G | G | | H | AC T | | K | GT | | M | AC | | N | ACGT | | R | A G | | S | CG | | T | T | | U | T | | V | ACG | | W | A T | | Y | C T | +-----+------+ Notice that "U" and "T" are totally equivalent in our query language. 2.)Alternative Matching Possibilities (OR). We can combine two alternative definitions of one punit by placing them into the construct ( punitA | punitB ) this is read "punitA OR punitB" and means that a match is acceptable if EITHER of the two criteria are met. With this construct it is possible (at reduced search speed) to recreate all of the ambiguity codes. For instance ((A | C) | G) is equivalent to the ambiguity code V. Notice that the peptide codons are now uniquely representable using these techniques: +-------+-------+-------------+ |Peptide|Abbrev.|RNA/DNA code | +-------+-------+-------------+ | Ala | A | GCN | | Arg | R | (CGN | AGR) | | Asn | N | AAY | | Asp | D | GAY | | Cys | C | TGY | | Gln | Q | CAR | | Glu | E | GAR | | Gly | G | GGN | | His | H | CAY | | Ile | I | ATV | | Leu | L | (TTR | CTN) | | Lys | K | AAR | | Met | M | ATG | | Phe | F | TTY | | Pro | P | CCN | | Ser | S | (TCN | AGY) | | Str | | RTG | | Ter | | (TAR | TGA) | | Thr | T | ACN | | Trp | W | TGG | | Tyr | Y | TAY | | Unk | X | NNN | | Val | V | GTN | +-------+-------+-------------+ 3.)Ellipses. One frequently wants to allow some number of basepositions to be "skipped over", i.e. to be matched regardless of what their identity is. An instance is that one often does not care what the identity of the nucleotides in the loops when searching for the classical sequence indicating a "hairpin" in RNA secondary structure. Our query punit which does this is the ellipsis such as: 4...16 which is two non-negative integers seperated by three periods, the first cannot be larger than the second. An explicit use of an ellipsis is to search for certain hairpins, such as: AAGCT 4...6 AGCTT Notice that this query has three punits, the middle of which is an ellipsis, and the outer two of which are reverse complements. It will match any hairpin with the specified sequences on both sides of the ladder and any length of loop ranging from 4 to 6, inclusive. Notice that the above query produces exactly the same result as the more complicated three punit query: AAGCT ((NNNN | NNNNN) | NNNNNN) AGCTT 4.)Specified Limits on Mismatches, Inserts and Deletes. Any single punit can be matched imprecisely by specifying a maximum number of mismatches, insertions and deletions which are required to transform the target into a precise match to the specified subsequence. Our language allows this to be specified by appending a bracketted addendum such as: AGCTT[1,2,3] the order of the numerical arguments is [#Mismatches , #Insertions , #Deletions] Virtually any target subsequence will match the preceeding punit because the [1,2,3] condition is so sloppy compared to the length of the specified sequence AGCTT. Examples of imprecise matches to AGCTT are given below: |----------|---------|---------| |Specified | [ , , ] | Matches | |----------|---------|---------| | AGCTT | [1,0,0] | AGCTT | | AGCTT | [1,0,0] | AGTTT | | AGCTT | [1,0,0] | ACCTT | | AGCTT | [1,0,0] | AGCCT | | AGCTT | [1,0,0] | AGCAT | | AGCTT | [0,1,0] | AGCTT | | AGCTT | [0,1,0] | AGACTT | | AGCTT | [0,1,0] | AGCTAT | | AGCTT | [0,1,0] | AAGCTT | | AGCTT | [0,0,1] | AGCTT | | AGCTT | [0,0,1] | AGCTT | | AGCTT | [0,0,1] | ACTT | | AGCTT | [0,0,1] | AGTT | | AGCTT | [0,0,1] | AGCT | |----------|---------|---------| Notice that the first and last nucleotides always are precisely matched and that the exact match always matches. 5.)Weighted Matching. Especially when recognizing certain motifs, etc. it is convenient to allow quite general scoring algorithms to be implemented. At each position, we can not require any sort of specific nucleotide to be present, but rather accumulate a score based upon what is there, and accept as a match any target pattern which exceeds a specified score. (20,40,10,30) accumulates a score of 20 for A, 40 for C, 10 for G, and 30 for T, respectively. A series of these scores is accumulated within curlies and is tested as follows: {(20,40,10,30),(10,10,80,0),(22,28,43,7)}>60 will be matched by NGN, CTA, CGC, etc. It will not be matched by GTT, TAT, CAT, etc. 6.)Labelled Punits. It is convenient to save re-typing sequence patterns by assigning labels to various punits. This is done in our language by the syntax p6= where the 6 could be replaced by any non-negative integer. The "p6" could be used at any point in later definitions within the query to stand for another copy of the punit to which it was attached with the "=". For example: p1=AAGCT GAG p1 is precisely the same as AAGCT GAG AAGCT or, of course AAGCTGAGAAGCT 7.)Reverse Complement Operator. Not only can we use the labels on punits to call out for repeats of some previously specified punit, but also for the reverse complement of some previously specified punit. This is particularly useful in specifying searches for substrings with likely secondary structure such as our first hairpin search above: AAGCT 4...6 AGCTT which could have been written more compactly as: p1=AAGCT 4...6 ~p1 A more general use of this construct is to find ANY hairpin with ladder of length ranging from 10 to 12 and loop ranging from 4 to 8: p1=10...12 4...8 ~p1 Searching Peptide Sequence In order to activate peptide searching mode, you will issue commands: Q T P Now all subsequent sequence queries will use peptide protocols. These are a subset of the above DNA/RNA protocols, modified as follows: 1.)Explicit Sequence With Ambiguity Codes. All letter codes are now considered to be the standard peptide abbreviations and "1...1" becomes the only "wildcard" character: +-------+-------+ |Peptide|Abbrev.| +-------+-------+ | Ala | A | | Arg | R | | Asn | N | | Asp | D | | Cys | C | | Gln | Q | | Glu | E | | Gly | G | | His | H | | Ile | I | | Leu | L | | Lys | K | | Met | M | | Phe | F | | Pro | P | | Ser | S | | Thr | T | | Trp | W | | Tyr | Y | | Unk | 1...1 | | Val | V | +-------+-------+ 2.)Alternative Matching Possibilities (OR). We can combine two alternative definitions of one punit by placing them into the construct ( punitA | punitB ) this is read "punitA OR punitB" and means that a match is acceptable if EITHER of the two criteria are met. 3.)Ellipses. One frequently wants to allow some number of peptides to be "skipped over", i.e. to be matched regardless of what their identity is. Our query punit which does this is the ellipsis such as: 4...16 which is two non-negative integers seperated by three periods, the first cannot be larger than the second. 4.)Specified Limits on Mismatches, Inserts and Deletes. Not implemented for peptides. 5.)Weighted Matching NOT YET IMPLEMENTED FOR PEPTIDES. 6.)Labelled Punits---NOT YET IMPLEMENTED FOR PEPTIDES. It is convenient to save re-typing sequence patterns by assigning labels to various punits. This is done in our language by the syntax p6= where the 6 could be replaced by any non-negative integer. The "p6" could be used at any point in later definitions within the query to stand for another copy of the punit to which it was attached with the "=". For example: p1=ELVIS GAG p1 is precisely the same as ELVIS GAG ELVIS or, of course ELVISGAGELVIS 7.)Reverse Complement Operator. is not applicable to peptides. NOTICE THAT WHEN YOU WISH TO REVERT TO SEARCHING DNA/RNA SEQUENCES, YOU MUST TOGGLE: Q T D CHAPTER 7 USING PC-GenoGraphics AS AN INTERACTIVE LOGBOOK In this chapter, the user will learn how to attach annotations to objects in PC-GenoGraphics. Also, some pithy hints on how to organize such annotations are provided. So far, the capabilities of PC-GenoGraphics which we have considered have been aimed at reasoning with "dead" datasets which apparently must have been provided in the original distribution of PC-GenoGraphics! In fact, the user is allowed some considerable leeway to annotate and modify existing datasets and even to create your own totally new maps. The first class of this interactive updating of the datasets is simple annotation of objects and implements what we consider to be an elementary interactive logbook. Update Files: What they Mean and How to Use them Restart PC-GenoGraphics from the DOS command line, but this time we will open a new "update file" (called MYADDS.UPD in this example) to hold our additions to the distribution file AATEST. This is done by the following sequence of commands: CD C:\GG (or whatever area you actually used) C: GG G A MYADDS G The first thing to understand about an update file like MYADDS.UPD is that IT APPLIES ONLY TO THE DATAFILE *.ALL FOR WHICH IT WAS CREATED. You will notice that, although we chose to create a totally new update file called MYADDS.UPD for this session, we might have considered continuing updates which were started by the authors of PC-GenoGraphics at Argonne National Laboratory. Three such files are included on the distribution: MORGAN.UPD, RAY.UPD, and ROSS.UPD. Had you tried to select ROSS.UPD or RAY.UPD, PC-GenoGraphics would have balked. The reason is that ROSS.UPD updates the file ECOLI_H1 while RAY.UPD updates the file COLORS. In the present tutorial, we have chosen to create a totally new update file, MYADDS.UPD rather than continuing our addenda onto the file MORGAN.UPD. For reasons which will become apparent later, it is also not allowed for any new update file to have the same first name (AATEST.UPD in the present case) as the datafile, *.ALL, which it updates. With our update file open, let us now perform some simulated scholarship by appending annotations onto object m1o3. Navigate the cursor to point at object m1o3, and click the cursor on it. This brings up the data-box attached to the object, and at the highest level of saliency there is only one comment line which is about gold. Next, go down in saliency: > > to see the full text-block of a few lines. Now to start adding annotation, we hit the "Add" button A and a new class of data-entry box will appear. At this point, you can start entering any style of text material which you wish to addend to the text-block in the data-box attached to object m1o3. A rude text editor with simple arrowkey navigation, etc. is enabled; the main limitation is that the total amount of information added before hitting "Ok" cannot exceed about 3000 characters. Notice that a suggested first line containing a datestamp has been tentatively included into your update. Rather than just entering text arbitrarily, WE STRONGLY RECOMMEND THAT YOU ADOPT THE FOLLOWING SORT OF PROTOCOL WHEN ANNOTATING FILES: 1.)All users who wish to annotate a given file share the SAME *.UPD file. 2.)The timestamp line ALWAYS be included into every update (even empty updates.) 3.)The next line (after the timestamp line) contains an identification AT THE HIGHEST LEVEL OF SALIENCY of who is doing the update, like this: >RAY_UPDATE or >MORGAN_UPDATE and that subsequent lines obey the full ">" protocol to identify their saliency and contain keywords that start with the name of the scholar and include other key information afterwards: >>RAY_GOLD_TACTICS This price seems kind of low relative to Singapore on the same day >>RAY_GOLD_STRATEGY The announcement of new supplies in the long term from Africa, suggest a moderate long-term downtrend. See object m3o1 on map PEP for further information. Notice how the continuation lines for each of the two headlines are indented with leading " " characters. Notice how each headline starts with a number of ">" characters to indicate the saliency of it and its continuation lines. Notice how the keywords RAY_GOLD_TACTICS and RAY_GOLD_STRATEGY are chosen to facilitate search in an orderly way. 4.)Bitter experience has shown to us that larding the text-blocks in data-boxes with blank lines with the intention of aiding readability is actually quite CONTERPRODUCTIVE. Add some set of comments like the above to object m1o3. When done, close the data entry box. This requires using the mouse (or and arrowkeys) to navigate to the "okay" button and click a mouse button (or hit ). The visual screen should return and now, when you click the mouse on object m1o3, you should see (when the saliency level has been chosen correctly) the information which you just addended apparently placed democratically at the end of the visual text-block. Further annotations of this kind will accumulate with their own date stamps, etc. right below the ones you have just added. In fact, in the present release of PC-GenoGraphics, the annotations addended as above are not elligible to be found by "Select Contents" commands. This limitation will, no doubt, be overcome in later releases, but at present you will need to "re-compile" your updates to create a new *.ALL file in which your annotations are completely democratically included. To perform this procedure, you need to exit PC-GenoGraphics, returning to the DOS prompt by hitting when no windows obtrude on the graphic map display or by navigating the cursor into the box in the upper right corner and clicking the cursor. Compiling the corrections is a two-step process requiring invoking "GGALLZWD first_name_of_ALL_file first_name_of_UPD_file", followed by invoking "GGTRANS first_name_of_UPD_file". In our our tutorial example this is precisely: GGALLZWD AATEST MYADDS GGTRANS MYADDS After this computation, which ranges from non-trivial to truly massive in size, two new files will emerge: MYADD.ZWD from the first step, and MYADD.ALL from the second. ONE THING TO KEEP IN MIND IS THAT THE LARGE COMPUTATIONS IMPLIED BY THIS ACTIVITY REQUIRE TRULY SIGNIFICANT AMOUNTS OF DISK SPACE. You can erase the no longer needed intermediate file MYADD.ZWD without any loss at this point if disk space is a problem. The next invocation of GG will contain a new file to select from on just the same footing as our original AATEST, namely, MYADD. Invoke GG now, and at the presetation of the input file name menu, navigate the cursor to point at the name MYADD.ALL in the scrollable list, and click the cursor. The name MYADD.ALL should appear in the small box at the top which indicates that it will be loaded when the "Go" button is pressed. Before the visual display comes up, another menu is offered, this time allowing you to select from one of the *.UPD update files from which to draw addenda. You will see MYADD.UPD among this list. Of course, that is not even an elligible selection at this point because MYADD.UPD applies updates to AATEST.ALL, and you have selected the file MYADD.ALL to display. In fact no *.UPD is at present elligible although you could always open a totally new update file (using the button "Anew") for upcoming annotations to MYADD.ALL. Whether or not you choose to open another update file, the new comments which we appended above are now accessible by Select on Contents just like any other comment. Thus the commands: S C RAY_GOLD A will select the object m1o3, and if you open its data-box, the "Find" button will function like normal for the new comments: > > F AFRICA will locate the strategic addendum entered above. CHAPTER 8 USING PC-GenoGraphics AS A VISUAL REASONING TOOL FOR SEQUENCE In this chapter, the user will learn how to create new maps based on the results of sequence queries. This may be the most elaborate capability of PC-GenoGraphics and will require study from any user who cares to use it. In Chapter 4 we discussed how to search for subsequences within the sequence data attached to object's data-boxes. At that point these searches seemed to be rather dead-end. The vertical stripes at the positions of the sequence matches grayed out the correct part of the screen, but (apart from being able to zoom to the next selected object, etc.) no further recourse to these sequence matches was evident in that presentation. In this chapter we will learn how to promote those sequence matches to fully viewable and queryable objects on a new map. This method of computing new maps has been rather well automated, but the user will notice that considerable resources are required from your PC. If you intend doing much of this class of action with our larger datasets you will probably need a fast 386 or 486 machine with much disk space free in a single partition (about 30 MB for E.coli and about 150MB for the whole human genome). On the other hand working fluently with viri and plasmids ought to be possible on pretty much any platform. The procedure is simple enough. Let us re-load AATEST: 1.)Re-load AATEST with a new update file: CD C:\GG (or whatever) C: GG G A ABTEST G At this point a screen looking like Fig.8 should appear. 4.)Open a new map to take the computed locations: Q M 3.)Perform a sequence query: Q Q UUUUGUUCAG C A At this point a couple of vertical gray stripes should appear. 4.)Save these positions into ABTEST.UPD and exit PC-GenoGraphics: Q S At this point we have included the description of a totally new map containing two new objects which are the two sequence matches. The other information entered above will be included strategically into the new map as well. 5.)Create the new files ABTEST.ZWD and ABTEST.ALL GGALLZWD AATEST ABTEST GGTRANS ABTEST 6.)And, if desired, erase the unecessary file: ERASE ABTEST.ZWD Now when you activate PC-GenoGraphics, you will be offered a new file to view, ABTEST.ALL. This file will contain the new map. CHAPTER 9 WORKING REPETITIVE PROCEDURES FROM SCRIPT FILES In this chapter, the user will learn how to invoke pre-prepared sets of queries or any other class of GenoGraphics commands, for that matter. In addition, explicit examples will be given for the most frequent application of this capability, namely, launching large batteries of sequence queries and saving their results. In many cases, the user may wish to perform a long set of PC-GenoGraphics commands which require a good deal of compuation time and which accumulate their consequences in a way that the user need not be present except to issue the commands. An ideal example of this would be for users who might generate various pieces of new sequence data and then want to perform the same set of sequence queries to each one of these. The first step, of course, would be to cast the new data into a form suitable for use by PC-GenoGraphics (See Chapters 11 and 12). Then one could, in principle, memorize the long list of queries and enter each one successively and have PC-GenoGraphics show the results interactively. But, in fact, this job is best done by a script of PC-GenoGraphics commands stored once in a file. This chapter will teach you to use such files and to make new ones of your own. Let us consider a precise set of commands which would create a new map called "EcXY" in the currently active update file, then search the whole genome for all occurrences of the DNA subsequence "GATTCGATTC" on either strand, without overlaps, and saves the answers as blue arrows on two different submaps into the presently open map in the active update file, finally closing that open map. In general, this takes commands: Q T D B Q M MINE My GAATC query, like my advisor taught me. Q Q GAATCGAATC C A Q S rightarrow solid blue FWD Q Q GATTCGATTC C A Q S leftarrow solid blue REV Q E This whole procedure is probably within reason for human entry once or twice, but if you want to do this a thousand times, you should prepare a file, *.INP, perhaps MINE.INP in this case, with your word-processor which contains the following: Q T D B Q M "MINE" "My GAATC query, like my advisor taught me." Q Q "GAATCGAATC" C A Q S "rightarrow" "solid" "blue" "FWD" Q Q "GATTCGATTC" C A Q S "leftarrow" "solid" "blue" "REV" Q E Now, whenever you want to do this query, you issue: F I MINE.INP and your PC will writhe around doing your glorious query for as long as it takes. One query like the above takes a minute or two on a modern PC, maybe ten minutes or more on an old one. We have provided a few sample queries into our distribution. Thousands of such queries have been compiled by David Ghosh of NCBI and translated by ourselves and are available seperately from us. A thousand queries is an overnight job, but the whole task runs unattended. CHAPTER 10 PRINTING THINGS AND DOING OTHER THINGS In this chapter we will learn how to make hardcopy of PC-GenoGraphics displays and how temporarily to exit PC-GenoGraphics to do simple DOS tasks like getting directory lists, printing documents, veiwing or deleting files, etc. If your PC has a printer attached you may be able to get hardcopy images of PC-GenoGraphics screens. We support only the two most common standards of printers, namely the Epson standard for dot-matrix printers and the HP-LJ2 standard for laser printers. These two standards cover most but not all PC printers. We regret that supporting other printers is not cost-effective. As of this writing, the street price for an excellent laser printer is about $600. When you have a PC-GenoGraphics screen which you want to save, the commands are in the "Files" menubar: F P At this point, you will be offered a large pushbutton panel to describe your printer. IF YOU DESCRIBE IT INACCURATELY AND CONTINUE THE PRINT ACTIVITY, IT IS POSSIBLE TO HANG YOUR PC, REQUIRING A RE-BOOT. You can exit this menu without printing by hitting . You need to know: 1.)Whether your printer is attached to a serial port or to a parallel port: The parallel port connector to your PC is nearly two inches wide, while the serial port connector is about 1 inch wide. 2.)Which port your printer is connected to. (LPT1 is the most common for parallel ports), COM1 or COM2 are the most common for serial ports. 3.)How large you want the image to be. For Epson printers "Low" is the largest, and "Hi" is the smallest. For HPLJ-2 printers, "75" is the largest and "300" is the smallest. The actual size of your image is determined by a combination of these choices and the resolution of your screen. Try the smallest sizes first and escalate while your images fit on a single sheet. 4.)The orientation of the image on the paper. Portrait mode places the image so that it would appear properly oriented in a normal book page. Landscape mode is sideways. When these selections have been made, hit the button "Print" and the screen should re-paint (in monochrome) and the data are transferred to your printer. This process can take a few minutes. Completion is announced by your screen returning to normal. The above is the only way PC-GenoGraphics supports hardcopy of its graphic screen images when no menus or information boxes obtrude into the image. If you wish, for instance, to get a hardcopy of the contents of a data-box, the above procedure is no use. This sort of function is done with the DOS shell command in PC-GenoGraphics. First you must save the desired text-block with "dumP" while the data-box is open; you will be asked to name the destination file, say you called it MYFILE.TXT. Then quit the data-box ("Quit") and invoke the DOS shell and issue the print command: F S PRINT MYFILE.TXT IT IS ONLY FAIR TO WARN YOU THAT THE PERFORMANCE OF THIS DOS SHELL IS SOMEWHAT TRICKY AND MANY POSSIBILITIES WILL HANG YOUR MACHINE, REQUIRING A RE-BOOT. For instance, the above PRINT command will probably hang your machine unless you have enabled your printer by issuing the command: PRINT to the DOS prompt BEFORE activating PC-GenoGraphics. The general rule in using the DOS shell from PC-GenoGraphics is to learn a reptioire of simple tasks that work, and if loosing you session would be unacceptable, AVOID USING THE DOS SHELL altogether. It is always possible totally to exit PC-GenoGraphics, do your DOS task and re-activate PC-GenoGraphics afterwards. Last, and not least is the question of setting the text labels on the PC-GenoGraphics graphics screen to their most legible size. PC-GenoGraphics allows three different fonts (Small, Medium, and Large) to be substituted. Obviously, using too large a font will make some of the labels so big that they are unable to fit within their natural location so that they are omitted altogether from the display. Also, some PCs may have too little memory available to hold the larger font sizes. To select the medium size font: S F M If too much memory is required, PC-GenoGraphics will notify you and switch back to the smallest size. CHAPTER 11 THE EASY WAY TO START YOUR OWN *.ALL FILE If you are in the fortunate position of creating or curating completely sequenced objects, we have shortcut the process of installing your data into PC-GenoGraphics. You first prepare your sequence data into some file, say MYSEQ.TXT, with a word-processor in the format -----------------------------------------------------------(start) Arbitrary line of information ACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTAC ACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTAC ACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTAC ACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTAC ACGTACGTACACGTACGTACACGTACGTACACGTACGTACACGTACGTACA -------------------------------------------------------------(end) Notice, 60 characters of sequence per line. Notice, one and only one information line at the top. Notice, NOTHING ELSE ALLOWED. Now the DOS command: GGTXTZWD MYSEQ.TXT and answer the questions about what type of sequence and its strand. Now you will have a new file, MYSEQ.ZWD which is further processed with the DOS command: GGTRANS MYSEQ which produces the new file MYSEQ.ALL, which is viewable and queryable on an equal footing with any other *.ALL file. Usually the next thing to do would be to compute some site maps with a set of canned query files as in chapter 9. CHAPTER 12 GORY DETAILS ABOUT *.ZWD FILES AND *.ALL FILES AND *.UPD FILES In this chapter, the user will learn the maximum capabilities to manipulate and create totally new or customized maps for PC-GenoGraphics. Very few users will have need for this material. So far, we have only considered the sorts of operations which can be performed starting with some *.ALL file which is included into your distribution. This is not a limitation of PC-GenoGraphics. In fact, you can create arbitrary maps, submaps, and objects of your own, either to add into the distributed *.ALL files, or else to create your own totally independent *.ALL files. This chapter will outline how to do arbitrary compilations which are fully visualizable, annotatable, and queryable using PC-GenoGraphics. This aspect of PC-GenoGraphics use is what we call curatorship, and you will find that distributing your compiled *.ALL files (together with PC-GenoGraphics) allows your correspondents an unparalled access to your intellectual work. The structure which we have seen previously has concentrated upon the *.ALL file (AATEST, for exmaple), perhaps as updated by one of its corresponding *.UPD files (MYADDS.UPD, for example). Recall that, if fully democratic querying and display capablity is required, the combination can be promoted to a new *.all file by DOS commands as follows: GGALLZWD AATEST MYADDS GGTRANS MYADDS Recall, further that the first line above creates an intermediate file (MYADDS.ZWD in this example) and that the second line creates the final file (MYADDS.ALL in this example). In fact, you could have done the above procedure even though no *.UPD file exists (you are asked to confirm that this is what you want to do first). Let us do this for AATEST: GGALLZWD AATEST BBTEST Y This creates the intermediate file (BBTEST.ZWD in this example) which corresponds precisely to AATEST.ALL without any modifications at all. Of course one could, at this point, issue the GGTRANS BBTEST command and recover a new file, BBTEST.ALL, which contains precisely the same information as AATEST.ALL. In fact, what we wish to do is to examine the "intermediate" file, BBTEST.ZWD. If possible, print out a copy of BBTEST.ZWD (6 pages total) to facilitate this part of our tutorial. Describing Objects: Skip down to the 10th and 11th lines, a somwhat unauspicious looking business: m1o1 DNA1|s1o1 1 12.000000 44.000000 rightarrow solid cerise 0000000093 This specifies the visual aspects of the object, labelled "m1o1" which is the furthest to the upper left in the dataset specified by BBTEST.ZWD, i.e. the right-pointing triangle in Fig 1. This line breaks into 6 strings of characters (seperated by spaces) on the first line and one more string on the next line. The first line specifies the visual aspect of this object: m1o1 The name which appears on the object when space allows. DNA1|s1o1 Another name used only for linking connections to it. 1 Which submap does this object lie in (NO DECIMAL POINT). 12.000000 Left coordinate of the object. 44.000000 Right coordinate. rightarrow Object shape. solid Object fill pattern. cerise Object color. Clearly we need to understand what this submap number means and what coordinate system is used, but the rest is clear. The second line specifies how many lines below define the data-box attached to this object: 0000000093 93 lines of data follow (NO DECIMAL POINT). Needless to say, the next 93 lines (which are dominated by a long DNA sequence) define what comes up in this object's data-box. After these lines, the next object description starts: m1o2 DNA1|s1o2 1 44.000000 64.000000 ldee solid blue Let us examine the structure of the 93 lines attached to the first object: ----------------------------------------------------------(start) >GOLD >ACCESSION M69872 ref: EMBO J. 137:183-193(1998) Notice obvious Typographical error here >SEQ_DNA start: 982632 length: 4246 000982682 GATCCCTCGTTCCGTCTTGTCGGAACTGGATATGATGGTCGGGAAAATCC 000982732 TCTGTTATCTCTATCTACGCCGGAACGGCTGGCGAATGAGGGGATTTTCA 000982782 CCCAGCAGGAACTGTACGACGAACTGCTCACCCTGGCCGATGAAGCAAAA : 000986832 GAAACTGGCGATCAGTTCCCGCAGCGTGGCGAACATCATTCGCAAAACCA 000986882 TTCAGCGCGAGCAGAACCGTATCCGTATGCTCAACCAGGGGTTGCA >>GOLD_PRICES Selected world gold prices, Monday: Hong Kong late: $355.65, off $1.50. ----------------------------------------------------------(end) The first four lines are textual data. Notice that they are arranged into three logical comment lines (as discussed in CHAPTER 4) with successive keywords "GOLD","ACCESSION", and "ref:". Notice that the last line is a continuation of the logical comment line headed by the keyword "ref:". Notice that all three lines are at the same level of saliency (the highest level). In general, textual data are quite unrestricted in format and content with the following limitations: 1.) Lines are no more than 64 characters wide. 2.) Only honest text characters are allowed: A-Z a-z 0-9 ~!@#$%^&*()_+{}|:"<>?`-=[]\;',./ In particular , , , and characters are NOT allowed. 3.) New logical comment lines are headed by a left-justified line with some number (zero to three, inclusive) of ">" characters identifying the saliency, followed by a character and any desired legal text characters. 4.) Continuation lines within a logical comment line are always headed by at least one character. If you obey these restrictions and choose your keywords and saliency levels intelligently, your data will be highly accessible to any interested scientist. Sequence data are much more restrictive in format: >SEQ_DNA start: 982632 length: 4246 Any given sequence fragment must lie within a single logical comment line with a header having one of the following keywords: CCW__DNA CCW__PEP CCW__RNA CCW_DNA CCW_PEP CCW_RNA CW__DNA CW__PEP CW__RNA CW_DNA CW_PEP CW_RNA FWD_DNA FWD_PEP FWD_RNA REV_DNA REV_PEP REV_RNA SEQ_DNA SEQ_PEP SEQ_RNA Which define the sequence type (DNA, RNA, or peptide) and its strand or its direction of expression. As usual, a character following the keyword can be followed (within the same line) by arbitrary text. The very next line must be of very precisely determined format: It MUST be lead by a character. This MUST be followed by a (multi-digit) integer (NO DECIMAL POINT) which identifies the sequence position of the FIRST element of sequence...if the sequence is "unplaced", this value MUST be -1. This integer MUST be followed by one or two characters. Then must follow EXACTLY 50 characters of sequence, NO characters, etc. are allowed. NO OTHER INFORMATION can follow at the end of a line. This same format is followed for each subsequent line of sequence data until the last one which, of course, need not have the full 50 characters of sequence NO BLANK LINES can intervene. After the last line of sequence, the next line must be the headline for a new logical comment line or else it must start the description of the next object...NO BLANK LINES can intervene. Last, and not least, the linecount (93 lines of data in this example), MUST be correct! Notice that this sequence data entry is followed, in our example, by another logical comment line (with two continuation lines) >>GOLD_PRICES Selected world gold prices, Monday: Hong Kong late: $355.65, off $1.50. Describing Maps and Submaps: Now scan upwards to lines 5 through 9 of your file BBTEST.ZWD: MAPBEGIN DNA1 0.000000 100.000000 1 black FALSE Full name of map 1 DNA 0000000000 These lines describe the first map (of three) in the file BBTEST.ZWD. A new map description is ALWAYS introduced by the MAPBEGIN line. The next line contains 6 strings (seperated by characters), in this example, these are: DNA1 Map name (Appears to the left when possible) 0.000000 Lowest coordinate on this map 100.000000 Highest coordinate 1 Number of Submaps (NO DECIMAL POINT) black Outline color for all objects on this map (MUST BE "black") FALSE Hint about display, TRUE means "This map can be vertically compressed" FALSE means "This map should not be vertically compressed" The next line: Full name of map 1 which can contain no more than 60 characters is an abbreviated description of the map which is used to allow the end-user to decide whether this map is of interest or not. This is followed, in this example by two more lines, one indicating the name of the (only) submap on this map, and the next telling how many lines follow giving the long description of the map (in this case, none). DNA 0000000000 A more general example of map description is for the third map in the file BBTEST.ZWD: MAPBEGIN PEP 0 100 3 black FALSE Full name of map of peptide data PEP1 PEP2 PEP3 1 Miscellaneous map information Notice here that three submaps are called out, and that three lines follow the seond line, one to name each of these submaps, and that there is one line in the longeer description of the map. Another totally crucial line tells when the description of one map actually ends: MAPEND Your *.ZWD file will never succeed until there is one MAPEND line for each MAPBEGIN line! Describing Connections Between Objects: Last, and not least, you have the capability to describe connections between objects on various maps and submaps. The section in BBTEST.ZWD which does this is: CONNBEGIN DNA1 DNA1|s1o1 PEP PEP|s2o1 0 DNA1 DNA1|s1o1 PEP PEP|s1o2 0 RNA1 RNA1|s1o3 DNA1 DNA1|s1o2 1 PEP PEP|s3o3 DNA1 DNA1|s1o3 2 RNA1 RNA1|s1o1 DNA1 DNA1|s1o4 3 PEP PEP|s3o2 DNA1 DNA1|s1o5 0 CONNEND Notice that this section starts with CONNBEGIN and ends with CONNEND. Your *.ZWD file CANNOT BE INTERPRETED IF IT DOES NOT HAVE THESE, even if no connections are described between. The present example describes 6 connections. Notice that a connection connects two objects and that the description has two strings for each object. These two strings are somewhat redundant, the first is the name of the map containing the object, such as: PEP while the second is made up from this same map name, a "|" character, an "s" character, an integer (NO DECIMAL POINT) identifying which of that map's submaps contains this object, an "o" character, and another integer identifying which object on the submap is desired: PEP|s2o1 In addition, each connection has another string which is an integer whose value does not matter at present. If you leave this out, however, your *.ZWD file will never succeed! This extremely rigid format definition reflects the original intent of the *.ZWD file, namely, to facilitate automatic transfer of data from orderly databases to PC-GenoGraphics. For small sized datasets, the motivated scientist who is willing to hew to this format will be able to create arbitrary maps, etc. using a standard word-processor or text editor to create the *.ZWD file. We are preparing visual tools which allow the user to create maps, etc. without any knowledge of the underlying file formats, etc. FIGURE 1 Screen Requesting Identification of Best Graphics Mode. This screen comes up from GGSETUP after initial questions identifying the source and target disk names and directories. The correct response is to type D to allow you to investigate another possible description of the video adapter in your PC. You should keep trying new video adapters until you find the one which works best with your machine. FIGURE 2 Screen with Candidate Video Adapter Descriptions. Presented by GGSETUP after you have identified the video mode to be investigated. Use the up and down arrow keys to highlight your next guess at the identity of your video adapter. Hit to investigate the applicability of the highlighted mode. FIGURE 3 Screen Showing Performance of Video Mode under GGSETUP Investigation. Your screen should look very much like this if you are investigating some video mode which operates successfully on your machine. You should move the cursor into and out of the blinking square and the legend at the top should change. Black/white video modes will not produce the square of 16 brightly colored blocks. If the blocks are produced, there should be 16 distinct colors visible. YOU CAN ALWAYS EXIT FROM THIS SCREEN (EVEN IF THE DISPLAY IS INACCURATE) BY HITTING FIGURE 4 Second Screen Showing Performance of Video Mode under GGSETUP Investigation. These tiles are drawn shortly after the previous screen is exited. This screen will spontaneously disappear after a few seconds, and the user will be asked to confirm whether the video mode under investigation is performing adequately. If you answer 'N', the screen depicted in Figure 1 will reappear and you can make another trial. IF YOU ANSWER 'Y', INSTALLATION WILL PROCEED WITH THE PRESUMPTION THAT THE VIDEO MODE WHICH YOU JUST SAW IS THE BEST ONE. FIGURE 5 Full-screen Display of E.coli dataset distributed with some copies of PC-GenoGraphics. FIGURE 6 Full-screen Display of GDB human dataset distributed with some copies of PC-GenoGraphics. FIGURE 7 Full-screen Display of HIV virus dataset distributed with some copies of PC-GenoGraphics. FIGURE 8 Full-screen Display of AATEST dataset distributed with some copies of PC-GenoGraphics. This dataset is of no biological interest, but should be studied thoroughly to understand PC-GenoGraphics. FIGURE 9 Information Databox Attached to Object m3o1 in AATEST. To fetch this databox you should position the cursor over object m3o1, and click any mouse button. If you do not have a mouse attached to your PC, you can navigate the cursor by holding and pressing the arrow keys, the equivalent of clicking a mouse button is .