27.5. Being Prepared
The incident response plan is not the only thing that you need to
have ready in advance. You need to set up a number of practices and
procedures so that you'll be able to respond quickly and
effectively when an incident occurs. Most of these procedures are
general good practice; some of them are aimed at letting you recover
from any kind of disaster; and a few are specific to security
incidents.
27.5.1. Backing Up Your Filesystems
Your filesystem backups are probably the
single most important part of your recovery plan. Before you do
anything else (including writing your response plan), make sure that
your site's backup plan is a solid one and that it works.
Don't assume that it's OK just because you haven't
had a problem yet. It is entirely possible to go for months without
noticing that you have no backups at all, and it may take you years
to notice that they're only partially broken. Unfortunately,
when you do notice, it's often when you need the backups most,
and the outcome is likely to be disastrous.
Backups are vital for two reasons:
- If your site suffers serious damage and you have to restore your
systems from scratch, you will need these backups.
- If you aren't sure of the extent of the damage, backups will
help you to determine what changes were made to a system and when.
Every organization needs a backup plan and not just for security
reasons. If you don't have one, that's probably a sign
that your current backup system is
not OK. When
you are doing incident response planning, however, pay special
attention to your backup plan.
For your security-critical systems (e.g., bastion hosts and servers),
you might want to consider keeping your monthly or weekly backups
indefinitely, rather than recycling them as you would your regular
systems. If an incident does occur, you can use this archive of
backup tapes to recover a "snapshot" of the system as of
any of the dates of the backups. Snapshots of this kind can be
helpful in investigating security incidents. For example, if you find
that a program has been modified, going back through the snapshots
will tell you approximately when the modification took place. That
may tell you when the break-in occurred; if the modification happened
before the break-in, it may tell you that it was an accident and not
part of the incident at all.
If you're not sure whether or not you should be worried, try
testing your backup system. Play around and see what you can restore.
Ask these questions:
- Can you restore files from all of your tapes?
- Can you do a restore of an entire filesystem?
If you pick a specific file, can you figure out how to restore it?
If you have a corrupt file and want a version from before it was
corrupted, can you do that?
- If all of your disks died (or were trashed by an attacker)
simultaneously, would you be able to rebuild your computer facility?
Even the best backup system won't work if the backup images
aren't safeguarded. Don't rely on online backups and keep
your media in a secure place separate from the data they're
backing up.
TIP:
The design of backup systems is outside the scope of this book. This
description, along with the description in Chapter 26, "Maintaining Firewalls", provides only a summary. If you're
uncertain about your backup system, you'll want to look at a
general system administration reference. See Appendix A, "Resources" for complete information on additional
resources.
27.5.2. Labeling and Diagramming Your System
As organizations grow, they acquire
hardware; they configure networking in different ways; and they add
or change equipment of various kinds. Usually only one or two people
really know what a site's systems look like in any detail.
Information about system configuration
may be crucial to investigating and controlling a security incident.
While you may know exactly how everything works and fits together at
your site, you may not be the person who has to respond to the
incident. What if you're on vacation? Think about what your
managers or coworkers would need to know about each system in order
to respond effectively to an incident involving that system.
Labels and diagrams are crucial in an emergency. System labels should
indicate what a system is, what it does, what its physical
configuration is (how much disk space, how much memory, etc.), and
who is responsible for it. They should be attached firmly to the
correct systems and easily legible. Use large type sizes and put at
least minimal labels on the back as well as the front (the front of a
machine may have more flat space, but you're probably going to
be looking at it from behind when you're trying to work on it).
Network diagrams should show how the various systems are connected,
both physically and logically, as well as things like what kind of
packet filtering is done where.
Be sure that labels are kept up to date as you move systems around;
wrong labels are worse than no labels at all. It's particularly
important to label racked equipment and equipment with widely
scattered pieces. There's nothing more frustrating than turning
off all the equipment in a rack, only to discover that some of it was
actually part of the computer in the next rack over, which you meant
to leave running.
Information that's easily available when machines are working
normally may be impossible to find if machines are not working. For
example, you'll need disk partition tables written down in
order to reformat and reinstall disks, and you may need a printed
copy of the host table in order to configure machines as
they're brought back up.
27.5.3. Keeping Secured Checksums
Once
you've had a break-in, you need to know what's been
changed on your systems. The standard tools that come with your
operating system won't tell you; intruders can fake
modification dates and match the trivial checksums most operating
systems provide. You will need to install a cryptographic
checksumming program (these are discussed in
Chapter 10, "Bastion Hosts"), make checksums of important files, and store
them where an intruder can't modify them (which generally means
somewhere offline). You may not need to checksum every system
separately if they're all running the same release of the same
operating system, although you should make sure that the checksum
program is available on all your systems.
27.5.4. Keeping Activity Logs
An activity log is a
record of any changes that have been made to a system, both before an
incident and during the response to an incident. Normally,
you'll use an activity log to list programs you've
installed, configuration files you've modified, or peripherals
you've added. During an incident, you'll be doing a lot
more logging.
What is the purpose of an activity log? A log allows you to redo the
changes if you have to rebuild the system. It also lets you determine
whether any of the changes affect the incident or the response.
Without a log, you may find mystery programs; you don't know
where they came from and what they were supposed to do, so you
can't tell whether or not the intruder installed them, if they
still work the way they're supposed to, or how to rebuild them.
Figure 27-4 shows a sampling of routine log entries
and incident log entries.
Figure 27-4. Activity logs
There are a variety of easy ways to keep activity logs, both
electronic and manual; email, notebooks, and tape recorders can also
be used. Some are better for routine logs (those that record your
activities
before an incident occurs). Others
may be more appropriate for incident logs (those that keep track of
your activities
during an incident).
Email to an appropriate staff alias that also keeps a record of all
messages is probably the simplest approach to keeping an activity
log. Not only will email keep a permanent record of system changes,
but it has the side benefit of letting everybody else know
what's going on as the changes are made. The email approach is
good for routine logs, whereas manual methods are likely to work more
reliably during an incident. During an actual security incident, your
email system may be down, so any messages generated during the
response may be lost. You may also be unable to reach existing online
logs during an incident, so keep a printed copy of these email
messages up to date in a binder somewhere.
Notebooks make a good incident log, but people must be disciplined
enough to use them. For routine logs, notebooks may not be convenient
because they may not be physically accessible when people actually
make changes to the system. Some sites use a combination of
electronic and paper logs for routine logs, with a paper logbook kept
in the machine room for notes. This works as long as it's clear
which things should be logged where; having two sets of logs to keep
track of can be confusing.
Pocket tape recorders make good incident logs, although they require
that somebody transcribe them later on. They're not reasonable
for routine logging.
27.5.5. Keeping a Cache of Tools and Supplies
Well before a security
incident, collect the tools and supplies that you are likely to need
during that incident. You don't want to be running around,
begging and borrowing, when the clock is ticking.
Here are some of the things you'll need in order to respond
appropriately to an incident. (Actually, you ought to have these
things around at all times; they come in handy in all sorts of
disasters.)
- Blank backup tapes and possibly spare disks as well.
- Basic tools; you'll need them if you disconnect your system
from the external network, or if you need to rewire the internal
network to disconnect compromised hosts. Make sure you have a ladder
if your site uses in-ceiling cabling or tall equipment racks.
- Spare networking equipment -- at least cables.
Set aside basic supplies (e.g., a full backup's worth of media,
networking cables, the most critical tools, notebooks or tape
recorders for incident logs) in a cache to be used only in case of
disaster. This cache should be separate from your normal stock of
spare parts and tools.
27.5.6. Testing the Reload of the Operating System
If a serious security
incident occurs, you may need to restore your system from backups. In
this case, you will need to load a minimal operating system before
you can load the backups. Are you equipped to do this?
ake sure that you:
- Understand your system's operating system installation
procedures
- Understand the procedures for restoring from backups
- Have all the materials (distribution media, manuals, etc.) available
to restore the system
- Test your reload plans and procedures before you really need them
Testing your ability to reload the operating system is a good idea,
and too few organizations ever do it. You can learn a lot by doing
this. While you're trying to reload a dead system is not a good
time to discover that you've got a bad copy of the distribution
media. It's also not a good time to discover that the people
who have to do the reload can't figure out how to do it. The
best way to test is to designate the least experienced people who
might have to do the work, and let them try out the reload well ahead
of time.
ost organizations find that the first time they try to reinstall the
operating system and restore on a completely blank disk, the
operation fails. This can happen for a number of reasons, although
the usual reason is a failure in the design of the backup system. One
site found that people were doing their backups with a program that
wasn't distributed with the operating system, so they
couldn't restore from a fresh operating system installation.
(After that, they made a tape of the restore program using the
standard operating system tools; they could then load the standard
operating system, recover their custom restore program, and reload
their data from backups.)
27.5.7. Doing Drills
Don't assume that
responding to a security incident will come naturally. Like
everything else, such a response benefits from practice. Test your
own organization's ability to respond to an incident by running
occasional drills.
There are two basic types of drills:
- In a paper (or "tabletop") drill, you gather all the
relevant people in a conference room (or over pizza at your local
hangout), outline a hypothetical problem, and work through the
consequences and recovery procedures. It's important to go
through all the details, step by step, to expose any missing pieces
or misunderstandings.
- In a live drill, you actually carry out a response and recovery
procedure. A live drill can be performed, with appropriate notice to
users, during scheduled system downtimes.
You might also test only parts of your response. For example, before
configuring a new machine, use it to test your recovery procedures by
recovering an existing machine onto it. If you have down time
scheduled for your facility, you may be able to use it to test what
happens when you disconnect from the network. Run your checksum
comparison program before and after you install changes to the
operating system to see what changes it catches when you think
everything's the same, and what it does about the things you
know have changed. Coordinate with another site to see what messages
are logged when various types of attacks occur (pick someone you know
and trust and who'll reliably tell you exactly what they did,
or do it yourself). Try taking down all of your central machines at
the same time and see whether they'll all come back up in this
situation. (Do this when you have a few hours to spare; if it
doesn't work, it often takes a while to figure out how to coax
the machines past their interdependencies.)
This is all a lot of trouble, but a certain amount of perverse
amusement can be had by playing around with fictitious disasters, and
it's much less stressful than having to improvise in a real
disaster.
| | |
27.4. Planning Your Response | | V. Appendixes |