CROSS REFERENCE TO RELATED APPLICATIONS
This application is a continuation of International Application
No.
PCT/JP2005/005352, filed on March 24, 2005
, now pending, herein incorporated by reference.
TECHNICAL FIELD
The present invention relates to an information processing
device using a memory controller, and in particular relates to an information processing
device which monitors memory anomalies regardless of the mounted memory capacity.
BACKGROUND ART
With increases in the scale of systems in recent years,
there has been an increase in the capacity of mounted memory, and high reliability
is also sought. Prompt detection of the location of malfunctions in memory is essential
for maintaining high reliability of large amounts of memory. To this end, memory
diagnostics and monitoring are indispensable.
Fig. 1 explains memory monitoring in the prior art. An
operating system (hereafter "OS") is running in the CPU 3. The CPU 3 is connected
to memory units 2i to 21.
In memory anomaly monitoring of the prior art, the CPU
monitors all memory areas in the memory units 2i to 21 in response to instructions
from the OS. In this case, read processing is performed by the OS via the CPU 3
of all areas in the mounted memory units 2i to 21. An area from which reading is
not possible is diagnosed as an error area, and degradation processing to remove
the error area from the usable area is performed.
The OS holds information on areas which the OS has itself
an information of degraded area, and itself secures the continuity of logical addresses.
Further, the OS ascertains in advance the mounted memory capacity and hardware configuration.
In such a method in which the CPU monitors all memory areas
under instruction from the OS, the load occurring at the time of operation is excessive
in a large-scale system which a huge memory capacity. Moreover, there is the problem
that too much time is required for monitoring processing. In order to alleviate
the load on the CPU, memory monitoring in which hardware other than the CPU performs
reading of memory areas is conceivable. By having hardware other than the CPU read
memory areas and confirm the presence or absence of errors in the read-out data,
the load on the CPU can be alleviated.
Fig. 2 is an example of memory monitoring performed by
hardware other than the CPU. The OS is running in the CPU 3. And, controllers C1
to C3, which are the hardware performing control and monitor of memory, are connected
to the CPU 3. Controller C1 is connected to memory units 2m and 2n, controller C2
is connected to memory units 2o and 2p, and controller C3 is connected to memory
units 2q and 2r.
The controller C1 to C3 control access to connected memory
units according to requests from the OS during normal access, but during memory
monitoring perform data reading from memory units, and upon detecting an error change
specific bits in a register of the controller and notify the OS.
In this case also, the OS has ascertained in advance the
amount of memory mounted and the hardware configuration. Further, the OS itself
holds information on previously degraded areas, and itself secures the continuity
of logical addresses.
The technology described in
Japanese Patent Laid-open No. 2000-57016
is a hardware monitoring system which alleviates the load on the CPU.
This technology suppresses frequent interruptions of applications due to errors
and reduces the load on the CPU by causing error processing to be performed by firmware.
However, the technology of
Japanese Patent Laid-open No. 2000-57016
relates to hardware in general, and does not perform monitoring of memory.
As shown in Fig. 2, even when hardware other than the CPU
is used to perform memory anomaly monitoring, there is the possibility that memory
addresses may be changed from the addresses of the previous architecture due to
memory expansion. In order to accommodate memory expansion, conversion into logical
addresses corresponding to each architecture must be performed; if the OS is caused
to execute this conversion, however, not all architectures can be accommodated by
a common OS. Moreover, if measures are taken to accommodate changes in architecture
due to hardware, then the need arises to install additional hardware for each architecture,
resulting in cost increases and increases in development processes.
DISCLOSURE OF THE INVENTION
Hence an object of this invention is to provide an information
processing device capable of memory monitoring by means other than the OS and hardware,
which accommodates different architectures, without direct memory monitoring by
a CPU.
In order to resolve the above problems, according to a
first aspect of the present invention, an information processing device, having:
a CPU which executes an OS and firmware; and a plurality of memory controllers which
are connected to the CPU, control writing to and reading from a plurality of memory
units, and perform error monitoring, wherein the plurality of memory units each
connected to the plurality of memory controllers, the memory controllers sequentially
read memory areas of the plurality of memory units connected to the memory controllers,
and perform error area monitoring; and the firmware converts addresses recognized
by the memory controllers corresponding to the error areas into logical addresses
recognized by the OS, and supplies the addresses to the OS.
In a preferred embodiment of the above first aspect of
the present invention, the firmware judges whether the error areas detected by the
memory controllers are areas which have been detected to be error areas by previous
reading and have been excluded from usage areas, and resumes reading of the memory
areas if the area have been excluded.
In a further preferred embodiment of the above first aspect
of the invention, the firmware judges whether data in the error areas is restored,
and the memory controller detecting the error area performs rewriting of the error
area if the data in the error area is restorable.
In a further preferred embodiment of the above first aspect
of the present invention, the plurality of memory controllers each perform monitoring
of memory errors independently.
In a preferred embodiment of a second aspect of the present
invention is a memory anomaly monitoring method in an information processing device
having a CPU which executes an OS and firmware, a plurality of memory controllers
which are connected to the CPU, control writing to and reading from a plurality
of memory units, and perform error monitoring, and a plurality of memory units each
connected to the plurality of memory controllers, the method having the steps of
sequential reading memory areas in the plurality of memory units connected to the
memory controllers and performing error area monitoring, by the memory controllers;
and converting address recognized by the memory controllers to the error areas into
logical addresses recognized by the OS, and supplying the logical addresses to the
OS, by the firmware.
In a preferred embodiment of the second aspect of the present
invention, further having a step of degradation judgment process of judging whether
the error area detected by the memory controller is an area which have been detected
as an error area by previous reading and has been excluded from usable areas, and
resuming memory area reading if the area is previously excluded, by the firmware.
In a preferred embodiment of the second aspect of the invention,
further having a step of restoration judgment of judging, by the firmware, whether
data of the error areas can be restored, and performing rewriting of the error data,
by the memory controller which has been detected the error area, if the data of
an error area can be restored.
By using firmware to modify logical addresses accompanying
changes in architecture, an information processing device of the present invention
can enable introduction of additional hardware without resulting in cost increases
or increases in development processes, and enables a common OS to be applied to
all architectures.
BRIEF DESCRIPTION OF THE DRAWINGS
- Fig. 1 explains memory monitoring in the prior art;
- Fig. 2 is an example of memory monitoring performed by hardware other than the
CPU in which memory areas are accessed;
- Fig. 3 shows the configuration of an information processing device of an aspect
of this invention;
- Fig. 4 shows the configuration of a memory controller and operation during normal
access;
- Fig. 5 shows the configuration of a memory controller and operation during memory
monitoring;
- Fig. 6 shows briefly the flow of operation for memory monitoring in an aspect
of the invention;
- Fig. 7 shows in detail the flow of operation for memory monitoring in an aspect
of the invention;
- Fig. 8 shows the flow of operation for memory monitoring halting in an aspect
of the invention; and,
- Fig. 9 shows the flow of operation for OS error monitoring during memory monitoring
in an aspect of the invention.
BEST MODE FOR CARRYING OUT THE INVENTION
Below, aspects of the invention are explained referring
to the drawings. However, the technical scope of the present invention is not limited
to these aspects, but extends to the present inventions described in the scope of
claims, and to present inventions equivalent thereto.
Fig. 3 shows the configuration of the information processing
device of an aspect of the invention. The information processing device of this
aspect has a CPU 3, which executes commands of the OS and firmware (indicated by
"Firm" in the figure). The CPU 3 is connected to a plurality of memory controllers
("MAC" in the figure) 1a to 1d via a system controller 4. The system controller
4 converts logical addresses received from the CPU 3 into memory controller addresses
used within the respective memory controllers 1a to 1d. The memory controllers 1a
to 1d are hardware which performs management of writing to and reading from, and
memory monitoring of, memory units 2a to 2h.
Fig. 4 shows the memory controller configuration and operation
during normal access. The memory controller 1 primarily has a memory monitoring
control portion 11, registers 12, error diagnosis portion 13, error correction portion
14, and memory management portion 15. During normal access, when the OS accesses
memory 2 via the CPU 3, first the logical address of the area to be accessed is
provided to the system controller 4 from the CPU 3. The system controller 4 receives
the logical address, and converts the address into a corresponding memory controller
address ("MAC address" in the figure) in the memory controller 1. On receiving the
supplied memory controller address, the memory management portion 15 in the memory
controller 1 accesses the data in the corresponding area in memory 2. The memory
2 supplies the data of the corresponding area to the error diagnosis portion 13
and error correction portion 14 in the memory controller 1.
When no errors exist in the data supplied from the memory
2, the data is output from the memory controller 1 and is received by the OS via
the CPU 3.
When an error exists in the data supplied from the memory
2, the error diagnosis portion 13 detects the error and judges whether the error
can be corrected. If the detected error cannot be corrected, the error correction
portion 14 appends information indicating that the data includes an uncorrectable
error, and transmits the data to the OS. At this time, the error diagnosis portion
13 records, in the registers 12, whether the error has been corrected, address information
for the error area, whether the error occurred during normal access or occurred
during memory diagnostics, and other information.
If the error in the supplied data can be corrected, the
error correction portion 14 outputs the corrected data from the memory controller
1, to supply the data to the OS via the CPU 3. At this time, the error diagnosis
portion 13 records, in the registers 12, whether the error has been corrected, address
information for the error area, whether the error occurred during normal access
or occurred during memory diagnostics, and other information.
During normal operation, the memory monitoring control
portion 11 is not used.
Fig. 5 shows the configuration of the memory controller
and operation during memory monitoring. The OS issues an instruction to the firmware,
via the CPU 3, to begin memory monitoring. The firmware writes to the registers
12 within the memory controller 1 via the CPU 3, and causes memory monitoring to
begin. Upon confirming writing to the registers 12 from the firmware, the memory
monitoring control portion 11 performs sequential reading of data from the memory
2. The memory 2 supplies data corresponding to memory controller addresses supplied
from the memory monitoring control portion 11 to the error diagnosis portion 13
and error correction portion 14 within the memory controller 1.
When no errors exist in the data supplied to the error
diagnosis portion 13, the error diagnosis portion 13 notifies the memory monitoring
control portion 11 of the fact that no errors exist. Upon receiving this information,
the memory monitoring control portion 11 accesses memory 2 to read from the next
area.
When an error exists in the data supplied to the error
diagnosis portion 13, the error diagnosis portion 13 judges whether the error can
be corrected. Then, the error diagnosis portion 13 notifies the memory monitoring
control portion 11 of the fact that an error exists, whether the error can be corrected,
address information for the error area, whether the error occurred during normal
access or during memory monitoring, and other information. Upon receiving this information,
the memory monitoring control portion 11 temporarily interrupts memory monitoring.
Then, the memory monitoring control portion 11 writes the information obtained from
the error diagnosis portion 13 to the registers 12.
The memory controller 1 has registers 12 to exchange information
with the firmware and OS. There are three types of control registers which start
and stop monitoring and perform other control; these are the monitoring control
register RG1, restart address register RG2, and rewrite address register RG3.
In the monitoring control register RG1 there exist a monitoring
start bit B1, restart address bit B2, monitoring stop bit B3, monitoring state bit
B4, rewrite bit B5, rewrite reset bit B6, correctable error bit B7, uncorrectable
error bit B8, and comparison error bit B9, and other bits.
Further, log registers which hold error information and
similar exist among the registers 12 within the memory controller 1. There are primarily
four types of log registers, which are an error address register RG4, error log
register RG5, permanent fault address register RG6, and permanent fault log register
RG7.
Fig. 6 shows briefly the flow of operation for memory monitoring
in an aspect of the invention. In this figure, steps are explained in a time series
from top to bottom; columns separated by dashed lines indicate steps performed by
the same hardware or the same software. Upon receiving an instruction from the OS,
the firmware writes to the registers 12 of all the memory controllers 1a to 1d,
issuing an instruction to start memory monitoring (step W1). On receiving this memory
monitoring start instruction, the memory controllers 1a to 1d start reading from
areas of memory 2 to which they are connected (steps W2a to W2d). When an error
is detected in memory controller 1b (step W3b), the memory monitoring control portion
11 writes, to a register 12 in the memory controller 1b, information indicating
whether the error can be corrected, address information for the error area, whether
the error occurred during normal access or during memory monitoring, and other information
(step W4b). The information written to the register 12 is accessed by the firmware,
degradation information is checked, and after a rewrite instruction and other error
processing has been performed (step W5b), memory monitoring is resumed (step W6b).
When an error is detected by another memory controller
also (step W3c), information is similarly written to the register 12 in the memory
controller 1c by the memory monitoring control portion 11 (step W4c), indicating
whether the error can be corrected, the error area address information, whether
the error occurred during normal access or during memory monitoring, and similar.
The information written to the register 12 is accessed by the firmware, degradation
information is checked, a rewrite instruction is issued, and other error processing
is performed (step W5c), after which memory monitoring is resumed (step W6c).
The OS accesses the registers 12 of all memory controllers
1a to 1d at fixed intervals to check whether errors have occurred (step W7). When
errors are confirmed to have occurred in memory controllers 1b and 1c, information
relating to these errors is requested of the firmware (step W8). Upon receiving
the request for information relating to the errors, the firmware accesses the memory
controllers 1b and 1c for which errors occurred, and provides the OS with information
relating to the errors (step W9). Upon receiving this information, the OS performs
processing to cause degradation and similar (step W10).
Here, a rare case in which two errors are simultaneously
detected in the same memory controller is explained. When, prior to accessing of
error information by the OS in step W7, another error is detected by the memory
controller 1b, the information written to the register 12 in step W4b is overwritten,
and the OS obtains only the information for the error which occurred later.
Fig. 7 shows the detailed flow of operation for memory
monitoring in an aspect of the invention. First, a decision is made by the OS to
start memory monitoring (step S1). At this time, the OS sends a memory monitoring
start instruction I1 to the firmware via the CPU. Upon receiving the memory monitoring
start instruction I1, the firmware sets the monitoring start bit B1 in the monitoring
control register RG1 within the memory controller 1 to "1" (step S2).
After deciding to start memory monitoring, the OS starts
checking the error state (step U1), and ends memory monitoring (step T1); this processing
is explained later using Fig. 8 and Fig. 9.
The memory controller 1, in response to the fact that the
monitoring start bit B1 of the monitoring control register RG1 has become "1", starts
memory monitoring (step S3). After being started, memory monitoring continues until
the OS sends a memory monitoring stop instruction 12 to the firmware; during this
interval, the memory controller 1 reads areas of memory (step S4), and when reading
of all areas has ended, waits for a fixed interval of time, and then begins reading
again (step S3).
At this time, the error diagnosis portion 13 within the
memory controller 1 checks whether errors have occurred in memory 2 (step S5), and
when an error occurs, the memory monitoring control portion 11 stops memory monitoring
(step S6). Then, the memory monitoring control portion 11 sets either the correctable
error bit B7 or the uncorrectable error bit B8 of the monitoring control register
RG1 to "1", according to the type of error (step S7). Error position information
is recorded in the error address register RG4, error log register RG5, and similar.
Next, in response to the fact that the correctable error
bit B7 or the uncorrectable error bit B8 of the monitoring control register has
become 1, the firmware performs degradation confirmation. Degradation is that the
error area in memory 2 is excluded from the usable area. The firmware judges whether
the area is an area which has already been degraded, based on information recorded
in the monitoring control register RG1 of the memory controller 1 (step S8).
When the error area is a degraded area, the firmware avoids
the area and resumes memory monitoring (step S9). At this time, the address information
for the area for resumption is set in the restart address register RG2 of the memory
controller 1, and the restart address bit B2 of the monitoring control register
RG1 is set to "1". In response to this register information, the memory monitoring
control portion 11 resumes memory monitoring.
When the error area is not the degraded area, a check is
performed via the registers 12 as to whether the error can be corrected (step S10).
Here correctable errors are explained. In this aspect, ECC (Error Check and Correct)
memory is used to accomplish error detection. A correctable error is a soft error
which occurs irregularly due to changes in data stored in memory. The soft error
is a data error which does not occur because of a problem with the circuitry, and
which does not reoccur when the data is corrected using data correction functions.
Using error correction functions, error correction of a detected correctable error
is performed based on the correction code. The correction code is code generated
within the MAC when processing data between the MAC and memory.
If the error is the correctable error, then the data which
should have been written can be determined, and so the firmware issues an instruction
to the memory controller 1 to write this data one more time (step S11). At this
time, the address of the area for rewriting is set in the rewrite address register
RG3, and the rewrite bit of the monitoring control register RG1 is set to "1". The
memory monitoring control portion 11 within the memory controller 1 performs writing
processing to these registers, and the memory monitoring control portion 11 begins
rewriting of the data that should have been written. At this time, monitoring for
another error occurrence is performed (step S13), and if the error occurs, the error
is judged to be a permanent fault arising in the hardware (step S14), and the error
is recorded in the permanent fault address register RG6 or the permanent fault register
RG7 (step S15). In step S13, when no error has occurred, the soft error is judged
to have occurred. This information is recorded in the error address register RG4
and error log register RG5 (step S15). In step S15, after recording in the registers
11, the firmware issues an instruction to the memory controller 1 to resume memory
monitoring (step S16).
When in step S10 the error is judged to be the uncorrectable
error, rewriting of the error area is not performed, and the firmware issues an
instruction to the memory controller 1 to resume memory monitoring (step S16).
The memory controller 1 resumes memory monitoring (step
S17), and returns to detection of error occurrences (step S5). This flow of operation
of memory monitoring is repeated until memory monitoring stop processing is performed.
Fig. 8 shows the flow of operation for memory monitoring
stoppage in an aspect of the present invention. First, the OS decided to stop memory
monitoring (step T1). At this time, the OS sends the memory monitoring stop instruction
12 to the firmware via the CPU. Upon receiving the memory monitoring stop instruction
12, the firmware sets the monitoring stop bit B3 of the monitoring control register
RG1 in the memory controller 1 to "1" (step T2). In response to the fact that the
monitoring stop bit B3 of the monitoring control register RG1 has changed to "1",
the memory monitoring portion 11 within the memory controller 1 stops memory monitoring
(step T3).
Fig. 9 shows the flow of operations for OS error monitoring
in memory monitoring of an aspect of the present invention. The OS starts monitoring
of the error detection state after the start of memory monitoring (step U1). At
this time, the OS sends a memory monitoring confirmation instruction I3 to the firmware
via the CPU. Upon receiving the memory monitoring confirmation instruction I3, the
firmware monitors each bit of the monitoring control register RG1 in the memory
controller 1 (step U2). At this time, if no error is detected processing returns
to step U1, and after a fixed length of time, confirmation of the error detection
state is again begun.
When an error is detected in the memory controller 1, the
OS issues an error information request to the firmware (step U3). Upon receiving
this request, the firmware creates error information to be sent to the OS from information
stored in the registers 12 of the memory controller 1, and sends the information
(step U4). Here, error information is the logical address which can be recognized
by the OS, whether the error is the permanent fault or the soft error, and similar.
The error information is sent from the firmware to the OS, and the OS performs logical
address and other processing based on this (step U5). After step U5, processing
returns to step U1, and after the fixed time has elapsed, confirmation of the error
detection state is again begun.
The firmware integrates information from all the memory
controllers 1, performs conversion into logical addresses and passes the error information
to the OS, so that the OS need not perform conversions into logical addresses. Further,
the firmware performs address conversion of the positions of errors detected by
the memory controller 1 according to the architecture, and provides the logical
addresses after processing to the OS. The OS executes error processing based on
logical addresses received from the firmware.
In this way, by using firmware to modify logical addresses
accompanying changes in architecture, additional hardware can be introduced without
resulting in cost increases or increases in the number of development processes,
and a common OS can be applied to all architectures.
INDUSTRIAL APPLICABILITY
In a large-scale system, high reliability is demanded given
the large number of memory units installed. Rapid detection of fault locations in
memory is essential for maintaining high reliability of large quantities of memory,
and to this end memory diagnosis and monitoring are indispensable. This invention
enables memory monitoring using a common OS, regardless of differences in hardware
configuration.