PCI power management

Jeff Garzik (jgarzik@mandrakesoft.com)
Thu, 19 Apr 2001 04:25:44 -0400


This is a multi-part message in MIME format.
--------------1BF3BCC7D191700823767A5F
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

This was originally a private reply to Patrick Mochel, but the e-mail
kept getting longer and longer :)

The current state of PCI PM is this:

pci_enable_device (1) enables IO and mem decoding, (2) assigns/routes
the PCI IRQ, and (3) brings the device to D0 using pci_set_power_state.
Linus believes the power state transition should occur before (1) and
(2), and I agree.

pci_set_power_state brings a device to a new D state. If the D state
transition is D3->D0, then we (1) save key PCI config registers, (2) go
to D0, and (3) restore saved PCI config registers. This originally
comes from Donald Becker's acpi_wake function, which is used only for
the case of device enabling (where he had no problems), not for the case
of returning-from-suspend (where we see problems).

"apm -s" causes the apm driver to map all suspends to the ACPI D3
state. An apm suspend triggers a pm_send_all call, which in turns
triggers pci_pm_suspend. This code [from Linus iirc] walks the root
buses, recursively suspending downstream buses and then attached
devices. The resume code does the exact opposite. The PCI core
suspend/resume code has this comment, and we note the current
requirement that -all- drivers should export suspend/resume somehow, in
order for a sane PM system to work here.

> * We do not touch devices that don't have a driver that exports
> * a suspend/resume function. That is just too dangerous. If the default
> * PCI suspend/resume functions work for a device, the driver can
> * easily implement them (ie just have a suspend function that calls
> * the pci_set_power_state() function).

It is up to the drivers to implement ::suspend() and ::resume(), and few
do. The few that do, even fewer work well in practice.

That's the current state of things. I do not think the system -- at the
PCI core level -- is poorly designed. I think it just takes a lot of
grunt work with drivers at this point, plus maybe a few new pci helper
functions.

So here's a random list of notes and issues on Linux PCI PM.

1) pci_enable_device needs to power up the device before enabling it.

2) AFAICT, it is safe to turn off a PCI device's bus-mastering bit and
take the device to D3, if it exports the PCI PM capability. My
previously-submitted pci_disable_function function turns off the
bus-mastering bit, and should probably take the device to D3 too.

3) The current pci_set_power_state implementation is non-spec, and even
though it works for some cases it does not appear like the right thing
to do.

4) Because of #2, I have create pci_power_on and pci_power_off.
pci_power_off saves ALL the PCI config registers, turns off
busmastering, and goes to D3. pci_power_on takes the device to D0, then
blasts the stored PCI config register data back to the hardware.

5) In testing, this works sometimes, but other times it causes the
upstream bridge of the device being resumed to stop decoding the device.

6) One solution to #4 is to save and restore the PCI bridge registers
too. This comes partially from a Linus suggestion, and partially from
an end user who solved their eepro100 suspend/resume problems with a
setpci command to their PCI bridge (not to the eepro100 device). In my
own testing this solution works 100%, but (a) it might not be right, and
thus (b) it might cause problems. I am -very- interested in feedback on
this solution, or a better one.

7) Due to #5 an open issue is to re-read the bridge and PCI PM specs.
Some portions of the spec imply that the bridge should never be touched
during device suspend or resume :)

8) Who can predict what a laptop's AML tables want to do with the PCI
bus, and if Linux will be interfering with ACPI suspend, or if ACPI will
be interfering with Linux resume, etc.

9) A truly green driver should register itself then disable its
hardware. It is wasting power otherwise. That implies waking up
hardware on dev->open and sleeping on dev->release. Some net drivers do
this already. This further implies problems down the road with stuff
like char drivers, where applications often open and close the device
node very rapidly. This happens in OSS audio land when some audio apps
start up, for example. Maybe an inactivity timer would work here, to
power down the device after time passes with no open(2) calls.

10) We might wind up needing northbridge, southbridge, and/or PCI bridge
drivers. They will likely be small, but I think eventually they will
need to exist in order to provide complete power management coverage.

11) Hard drives. Our IDE and SCSI subsystems stink when it comes to
working with the PCI PM framework. Andre has spoken of plans to use
pci_driver in 2.5, and turn the IDE subsystem "inside out" so that PCI
drivers call out to registration functions, etc., instead of the current
system. The same thing needs to happen for SCSI.

12) Continuing #11, there needs to be a general notion of when the
system should -not- write stuff to disk. This is mainly a userspace
issue, ie. low-priority syslog messages should not prevent the system
from idling the hard drive and spinning it down. BUT.. the kernel may
need to be the central arbiter if only to have a single place which says
"hard drive is idle now"...

I have attached the pci_power_{on,off} implementation to this message.
Note that the current checked-in implementation does not suspend/resume
bridges, I only did that in local versions of the test laptop kernels...

-- 
Jeff Garzik       | "The universe is like a safe to which there is a
Building 1024     |  combination -- but the combination is locked up
MandrakeSoft      |  in the safe."    -- Peter DeVries
--------------1BF3BCC7D191700823767A5F
Content-Type: text/plain; charset=us-ascii;
 name="pcipm.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="pcipm.patch"

Index: drivers/pci/pci.c =================================================================== RCS file: /cvsroot/gkernel/linux_2_4/drivers/pci/pci.c,v retrieving revision 1.1.1.32 retrieving revision 1.1.1.32.2.1 diff -u -r1.1.1.32 -r1.1.1.32.2.1 --- drivers/pci/pci.c 2001/04/18 01:19:31 1.1.1.32 +++ drivers/pci/pci.c 2001/04/18 03:39:02 1.1.1.32.2.1 @@ -228,49 +228,157 @@ } /** - * pci_set_power_state - Set power management state of a device. - * @dev: PCI device for which PM is set - * @new_state: new power management statement (0 == D0, 3 == D3, etc.) + * pci_power_on - Wake up a PCI device + * @dev: PCI device to which power is to be applied * - * Set power management state of a device. For transitions from state D3 - * it isn't as straightforward as one could assume since many devices forget - * their configuration space during wakeup. Returns old power state. + * Bring the given PCI device @dev up to full power, + * using standard PCI PM techniques. Any saved context + * is restored after device power-up. + * + * RETURN VALUE: Zero is returned upon successful completion + * of the wake-up operation. */ + int -pci_set_power_state(struct pci_dev *dev, int new_state) +pci_power_on(struct pci_dev *dev) { - u32 base[5], romaddr; - u16 pci_command, pwr_command; - u8 pci_latency, pci_cacheline; - int i, old_state; - int pm = pci_find_capability(dev, PCI_CAP_ID_PM); + u16 pwr_command; + int pm_d_state, pm, i; + + /* find PCI PM capability in list */ + pm = pci_find_capability(dev, PCI_CAP_ID_PM); + if (!pm) return 0; /* assume no PM == poweron success */ - if (!pm) - return 0; + /* make sure we aren't already in D0 state */ pci_read_config_word(dev, pm + PCI_PM_CTRL, &pwr_command); - old_state = pwr_command & PCI_PM_CTRL_STATE_MASK; - if (old_state == new_state) - return old_state; - DBG("PCI: %s goes from D%d to D%d\n", dev->slot_name, old_state, new_state); - if (old_state == 3) { - pci_read_config_word(dev, PCI_COMMAND, &pci_command); - pci_write_config_word(dev, PCI_COMMAND, pci_command & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)); - for (i = 0; i < 5; i++) - pci_read_config_dword(dev, PCI_BASE_ADDRESS_0 + i*4, &base[i]); - pci_read_config_dword(dev, PCI_ROM_ADDRESS, &romaddr); - pci_read_config_byte(dev, PCI_LATENCY_TIMER, &pci_latency); - pci_read_config_byte(dev, PCI_CACHE_LINE_SIZE, &pci_cacheline); - pci_write_config_word(dev, pm + PCI_PM_CTRL, new_state); - for (i = 0; i < 5; i++) - pci_write_config_dword(dev, PCI_BASE_ADDRESS_0 + i*4, base[i]); - pci_write_config_dword(dev, PCI_ROM_ADDRESS, romaddr); + pm_d_state = pwr_command & PCI_PM_CTRL_STATE_MASK; + if (pm_d_state == 0) return 0; + + /* go to D0 */ + /* XXX: should we enable function's ability to assert + * PME# here (bit 8) too? + */ + pci_write_config_word(dev, pm + PCI_PM_CTRL, 0); + + /* + * restore context, if saved + */ + if (dev->saved_context) { + /* XXX: 100% dword access ok here? */ + for (i = 0; i < dev->saved_context->n_dwords; i++) + pci_write_config_dword(dev, i * 4, + dev->saved_context->cfg_hdr[i]); + + kfree(dev->saved_context); + dev->saved_context = NULL; + } + + /* + * otherwise, write the context information we know from bootup. + * This works around a problem where warm-booting from Windows + * combined with a D3(hot)->D0 transition causes PCI config + * header data to be forgotten. + */ + else { + for (i = 0; i < 6; i ++) + pci_write_config_dword(dev, + PCI_BASE_ADDRESS_0 + (i * 4), + dev->resource[i].start); pci_write_config_byte(dev, PCI_INTERRUPT_LINE, dev->irq); - pci_write_config_byte(dev, PCI_CACHE_LINE_SIZE, pci_cacheline); - pci_write_config_byte(dev, PCI_LATENCY_TIMER, pci_latency); - pci_write_config_word(dev, PCI_COMMAND, pci_command); - } else - pci_write_config_word(dev, pm + PCI_PM_CTRL, (pwr_command & ~PCI_PM_CTRL_STATE_MASK) | new_state); - return old_state; + } + + return 0; +} + +/** + * pci_power_off - Suspend a PCI device + * @dev: PCI device to be suspended + * @context_size: Number of PCI config bytes to save + * + * Remove power from a PCI device, saving PCI context + * before fully transitioning to the D3 state. + * + * The @context_size argument can be -1, which indicates + * that only the standard PCI 2.2 configuration header + * is to be saved. @context_size can be zero, which indicates + * no context is to be saved. Or, @context_size can be a + * specific length, indicating the number of bytes to be saved + * before poweroff. @context_size is always rounded up to the nearest + * dword boundary. + * + * RETURN VALUE: If the PCI device + * does not support PCI PM, %EIO is returned. If memory + * is not available to store the PCI context requested, + * %ENOMEM is returned. Otherwise, zero (success) is returned. + */ + +int +pci_power_off(struct pci_dev *dev, int context_size) +{ + u16 pwr_command, tmp, newtmp; + int pm_d_state, pm, i; + void *mem; + + /* find PCI PM capability in list */ + pm = pci_find_capability(dev, PCI_CAP_ID_PM); + if (!pm) return -EIO; /* this device cannot poweroff */ + + /* make sure we aren't already in D3 state */ + /* XXX: reliable/superfluous test? */ + pci_read_config_word(dev, pm + PCI_PM_CTRL, &pwr_command); + pm_d_state = pwr_command & PCI_PM_CTRL_STATE_MASK; + if (pm_d_state == 3) return 0; + + /* programmer error... */ + if (dev->saved_context) + BUG(); + + /* + * save context + */ + if (context_size == -1) /* save only standard PCI config header */ + context_size = 15 * sizeof(u32); + if (context_size > 0) { + /* convert bytes to dwords, with rounding */ + if (context_size % 4 == 0) + context_size >>= 2; + else + context_size = (context_size >> 2) + 1; + + mem = kmalloc(sizeof(struct pci_dev_context) + + (context_size * sizeof(u32)), GFP_KERNEL); + if (!mem) + return -ENOMEM; + dev->saved_context = mem; + dev->saved_context->n_dwords = context_size; + dev->saved_context->cfg_hdr = mem + sizeof(struct pci_dev_context); + + /* XXX: 100% dword access ok here? */ + for (i = 0; i < dev->saved_context->n_dwords; i++) + pci_read_config_dword(dev, i * 4, + &dev->saved_context->cfg_hdr[i]); + } + + /* _PCI System Arch._ sez "disable device's ability to act as + * a master and a target." Interpreted as clearing the + * master, MEM decode and IO decode bits + */ + pci_read_config_word(dev, PCI_COMMAND, &tmp); + newtmp = tmp & ~(PCI_COMMAND_IO|PCI_COMMAND_MEMORY|PCI_COMMAND_MASTER); + if (tmp != newtmp) + pci_write_config_word(dev, PCI_COMMAND, newtmp); + + /* just for the sake of sanity and pessimism, pause for a bit, + * then clear any status conditions. PCI status register + * is nicely designed so we can clear it thusly.. + */ + pci_read_config_word(dev, PCI_STATUS, &tmp); + pci_write_config_word(dev, PCI_STATUS, tmp); + + /* go to D3 */ + pci_write_config_word(dev, pm + PCI_PM_CTRL, 3); + + return 0; } /** @@ -285,10 +393,13 @@ pci_enable_device(struct pci_dev *dev) { int err; + + err = pci_power_on(dev); + if (err) return err; + + err = pcibios_enable_device(dev); + if (err < 0) return err; - if ((err = pcibios_enable_device(dev)) < 0) - return err; - pci_set_power_state(dev, 0); return 0; } @@ -1390,7 +1501,8 @@ EXPORT_SYMBOL(pci_find_subsys); EXPORT_SYMBOL(pci_set_master); EXPORT_SYMBOL(pci_set_dma_mask); -EXPORT_SYMBOL(pci_set_power_state); +EXPORT_SYMBOL(pci_power_on); +EXPORT_SYMBOL(pci_power_off); EXPORT_SYMBOL(pci_assign_resource); EXPORT_SYMBOL(pci_register_driver); EXPORT_SYMBOL(pci_unregister_driver); Index: include/linux/pci.h =================================================================== RCS file: /cvsroot/gkernel/linux_2_4/include/linux/pci.h,v retrieving revision 1.1.1.39 retrieving revision 1.1.1.39.2.1 diff -u -r1.1.1.39 -r1.1.1.39.2.1 --- include/linux/pci.h 2001/04/18 01:11:14 1.1.1.39 +++ include/linux/pci.h 2001/04/18 03:44:33 1.1.1.39.2.1 @@ -308,6 +308,11 @@ #define pci_for_each_dev_reverse(dev) \ for(dev = pci_dev_g(pci_devices.prev); dev != pci_dev_g(&pci_devices); dev = pci_dev_g(dev->global_list.prev)) +struct pci_dev_context { + int n_dwords; + u32 *cfg_hdr; +}; + /* * The pci_dev structure is used to describe both PCI and ISAPnP devices. */ @@ -330,6 +335,11 @@ u8 rom_base_reg; /* which config register controls the ROM */ struct pci_driver *driver; /* which driver has allocated this device */ + + struct pci_dev_context *saved_context; + /* PCI config header, when suspended. + NULL when active */ + void *driver_data; /* data private to the driver */ dma_addr_t dma_mask; /* Mask of the bits of bus address this device implements. Normally this is @@ -528,7 +538,8 @@ int pci_enable_device(struct pci_dev *dev); void pci_set_master(struct pci_dev *dev); int pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask); -int pci_set_power_state(struct pci_dev *dev, int state); +int pci_power_on(struct pci_dev *dev); +int pci_power_off(struct pci_dev *dev, int context_size); int pci_assign_resource(struct pci_dev *dev, int i); /* Helper functions for low-level code (drivers/pci/setup-[bus,res].c) */

--------------1BF3BCC7D191700823767A5F--

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/