One way to improve performance is to use a different data structure. For example, you can encode the state of a cell's neighbours using 8 bits (one bit for each neighbour).
You can have a look-up table with 256 elements, which decides what to do for any combination of neighbour-bits: that's a single table look-up.
If a cell-state changes (which happens very rarely) update the corresponding bits in its neighbours' neighbour-bits.
Your universe is something like:
boolean state;
byte neighbours;
Your recalculateUniverseState method is something like:
foreach (cell in universe)
{
boolean newState = (state) ? liveState[neighbours] : deadState[neighbours];
if (newState != state)
changedCells.add(cell);
}
foreach (cell in changedCells)
{
// change the state of this cell
// update the bits in this cell's neighbours
}
This is just an example from memory; there's a more efficient way to do it: Abrash's Zen of Code Optimization describes an structure which encode the cell's state and its neighbour's states in 8 bits (relying on the fact that at least one bit is redundent because it's encoded in a next-door neighbour).
There are algorithmic speedups suggested on Wikipedia and probably on this site too: for example, remember which cells or areas of the board aren't changing and don't recalculate those.