Debugging Emacs Issues

When working with Emacs, you can use M-x report-emacs-bug RET to report any issues you notice.

The Emacs community is very responsive and corrects many of the issues that are filed. Therefore, reporting issues is extremely worthwhile and will enhance your productivity and also that of others.

The file etc/DEBUG that is contained in the Emacs source explains how to build Emacs with support for debugging in the sense of using the GNU Debugger (GDB). This is a low-level tool that can be quite useful for developers, but it is not needed to apply the methods I explain in this text.

Instead of wrestling with low-level tools, I explain a few general techniques that only require Emacs itself to find and demonstrate issues, and to report them in such a way that they can be easily corrected.

General advice

Here are a few general tips for detecting and debugging Emacs issues: These methods carry over also to other projects. For example, I found crashes, memory leaks, starvation issues, rare mistakes and race conditions by applying the same ideas in other contexts.

Case studies

Here is a short account describing how I found a series of issues in Emacs, starting from 3 initial issues. I am telling you this to give you a few ideas for locating and reproducing issues. Also, I would like to encourage you to also report any issues you find, and to never lose hope even if you find many additional problems along the way. In fact, you will come to regard that as the typical case.

In 2007, I was working on a complex document, and I regularly encountered the following 3 issues: Each of these issues was extremely annoying to me. However, I was so busy with writing the document and related papers that I did not take the time to look into any of these issues. I briefly filed the first issue as a GNUS problem, but never heard back and did not have the time to follow up on it. In retrospect, I would likely even have saved time if I had taken a day off to reproduce and file at least one of these issues.

In March 2008, the document was finished, and I finally had the time to look into all these issues for real.

I first filed the issue that had annoyed me the most: Emacs simply hanging.

#84: 23.0.60; Occasional hangs in flyspell-mode and ispell-word

A few months later, I filed another, slightly different case that seemed at least related:

#425: 23.0.60; Hang in wait_reading_process_output

After this, I finally started to construct systematic test cases. The idea was to simply repeatedly invoke something via a program, and see what happens. The design of Emacs makes such tests very simple and pleasant: Most Emacs features can be triggered by very simple Elisp programs.

The first test case I constructed repeatedly triggers one of the situations in which I had noticed that Emacs would sometimes hang: spellchecking a word in the current buffer. The idea is easy to implement in Emacs Lisp, using ispell-word:
(let ((n 0))
    (insert "test")
    (while t
      (setq n (1+ n))
      (when (= (mod n 100) 0)
        (message "n: %s -- %s" n (emacs-uptime)))
      (ispell-word nil t))))
With this test case, I could not reproduce the hang, but I found an issue in the underlying spellchecking program with it!

#496: 23.0.60; ispell-word becomes increasingly slower

With Aspell up to and including 0.60.0, the following invocation uses increasingly more memory:
while true; do echo "-"; done | aspell -a
At this point, I began to doubt some of the believes I had hitherto held: If even the spellchecker contains such mistakes, maybe Emacs is also not as robust as I had thought.

For the time being, the slowdown in the spellchecker prevented me from running more complex cases around the clock. So I decided to find the cause of the crash mentioned above. I remembered that the crash had once happened when working with SVG files, so I concentrated on SVG-related workflows. One of the first issues I found was:

#501: 23.0.60; Viewing SVG files: Error when pressing C-v, M-v

Very soon after that, I encountered the crash again, but had not yet found a reliable test case:

#502: 23.0.60; Occasional crash when viewing SVG files

While working on all this, the hang that was the most annoying issue also kept reappearing. I filed an additional issue since it looked different from the case I had already reported:

#532: 23.0.60; hang, then crash

For the time being, I could not do more about this.

Now, regarding the flyspell issue: Picture yourself working on a long and complex text, with flyspell-mode enabled. Then, after several hours of working on the text, you realize that flyspell no longer underlines spelling mistakes because it was silently disabled. If you rely on the spellchecker and work under the assumption that it is running, this can cause a lot of additional work because you have to re-check parts you have already written. In addition, having to worry about the spellchecker is a major detraction from your actual work. This is completely unacceptable, and so I wanted to find the cause of this problem.

When the spell checker was silently disabled, I knew (from running ps on a terminal) that the underlying aspell process was also no longer running. Thus, I decided to pinpoint the exact moment the aspell process stopped running, and make Emacs alert me if that happened.

A simple way to do this is:
  1. invoke ps from within Emacs
  2. write its output into a temporary buffer
  3. use automated text search to see whether "aspell" appears in that buffer. If it doesn't, then aspell has stopped running.
In Emacs Lisp, I wrote this as follows:
(defun aspell-alive-p ()
    (let ((p (start-process "ps" (current-buffer) "ps" "-A")))
      (while (eq (process-status p) 'run)
        (accept-process-output p nil nil t))
      (goto-char (point-min))
       (format "%s.*aspell" (process-id ispell-process)) nil t))))
This sounds like one of the simplest approaches. Yet, it exposed an underlying additional issue of Emacs: When I first did it like this, the aspell process unexpectedly stopped running just by applying this simple recipe! In fact, I found out that just invoking the following form destroys the aspell process that is used by flyspell-mode:
(with-temp-buffer (start-process "ps" (current-buffer) "ps"))
Thus I filed the following issue, trimmed down to the essence, and using bc as an example process:

#554: OSX: with-temp-buffer kills unrelated processes

Meanwhile, I continued with the following definition, which checks twice per second whether flyspell-post-command-hook is still enabled:
(defun my-flyspell-check ()
  (unless (memq 'flyspell-post-command-hook post-command-hook)
    (when flyspell-mode
      (with-current-buffer (get-buffer-create "flywarn")
        (insert "Flyspell no longer active!\n"))
      (display-buffer  "flywarn"))))

(setq flycheck-timer (run-with-timer 0 0.5 'my-flyspell-check))
For safety, I am still, to this day, running Emacs with this background check constantly enabled! It's simply awesome that you can write Emacs definitions that check integrity constraints of Emacs itself.

It was getting time to look again at possible causes of the hang. By then, I had already found a combination of actions that always produced the hang. It involved three invocations of GNUS, and also a spellchecker. The recipe is described in f.el, comprising the following definitions and instructions:
;; 1) emacs -Q f.el -f eval-buffer
;; 2) M-x gnus RET q y
;;    M-x gnus RET q y
;; 3) M-! killall -9 aspell RET
;; 4) M-x gnus RET q y

(defun reactivate-flyspell ()
  (unless (memq 'flyspell-post-command-hook post-command-hook)
    (flyspell-mode 1)))

(setq my-idle (run-with-idle-timer 0.1 t 'reactivate-flyspell))
Before reporting such a quite complex recipe, I plastered the networking code with debugging information and tracked the actual sequence of low-level events within Emacs. I then wrote a simple self-contained test case that clearly exhibited the same issue, using three file descriptors:

#562: 23.0.60; OSX: make-network-process reuses existing file descriptors

Two days later, I applied textual bisection and a global variable to pinpoint the code that caused the problematic behaviour only on the second invocation of GNUS or a network connection. The solution of this problem, which had accompanied me for months, was to apply the following single line patch:
diff --git a/src/process.c b/src/process.c
index b0bebeb..b5aebdc 100644
--- a/src/process.c
+++ b/src/process.c
@@ -3366,7 +3374,7 @@ usage: (make-network-process &rest ARGS)  */)
       hints.ai_protocol = 0;
-      res_init ();
+      /* res_init (); */
       ret = getaddrinfo (SDATA (host), portstring, &hints, &res);
It turned out that this was correctly diagnosed by YAMAMOTO Mituharu months before I reported this issue, but the solution suggested by Chong Yidong had unfortunately not been applied. It also turned out that upgrading my OS to the then latest version would have solved the issue as well.

This still left two of my main issues unresolved. I wanted to track down the cause of the crash next. Having successfully solved the previous issue, I became increasingly relentless in the ways I tested Emacs. To look for causes of the crash, I again applied search by brute force: I knew that the crash had once happened when viewing an SVG file, and so repeatedly displaying such a file may again trigger it. Therefore, I wrote an Emacs Lisp program that simply repeatedly displays an SVG file:
  (find-file "~/emacs/etc/images/splash.svg")
  (while t
Again, I could not elicit a crash in this way, but I found yet another issue in Emacs:

#576: 23.0.60; displaying SVG leaks memory

Then, I also encountered the crash when working with other files, and I eventually constructed the following test case, which sufficed to correct the issue:

#580: 23.0.60; OSX: Crash in show-paren-mode

Thus, two of the three primary issues were now fixed.

In the following weeks, I celebrated and recapitulated what I had found out. In doing so, I remembered an additional issue I had encountered, and had hurriedly passed over, when constructing test cases for the hang: Sometimes, when exiting Emacs, it would ask me to quit fewer processes than I knew I had started. Thus, I filed one more issue:

#723: 23.0.60; query-on-exit-flag sometimes unexpectedly nil

It is understandable if you do not take the time to pursue additional issues that are seemingly unrelated to what you actually want to accomplish. Still, my recommendation is to work very diligently, and if necessary, quite slowly and carefully when debugging programs. In very many cases, you will find several additional issues in this process. These issues may also help to improve robustness and even allow additional methods of stressing the core features you "actually" care about.

Once more, a good approach is to simply let a program perform the work for you. In the above case, I eventually triggered a hard crash of the entire operating system by repeatedly starting and stopping a process:
(let ((n 0))
  (while t
    (setq n (1+ n))
    (message "iteration: %s" n)
    (delete-process (start-process "bc" nil "bc"))))
I thus reported yet another issue:

#726: 23.0.60; OSX: Complete OS crash

Also in this case, it turned out that upgrading the OS to the then latest version would have solved the issue.

At this point, I was already running Emacs with the background check shown above, and I had already seen the warning being triggered that flyspell was no longer active. I thus filed the following issue:

728: 23.0.60; flyspell checking is sometimes silently disabled

If you have read through the above, you can appreciate how good it feels to receive the response "You need to try and track down the porigian of this message." when filing this issue.

One month later, I found a reliable test case and submitted a patch that corrected this issue. Thus, at last, all 3 issues I initially mentioned were fixed!

Opening words

Ulrich Neumerkel told me two powerful metaphors about programs: First, in all programs there is a path that is well-trodden and unlikely to contain mistakes. Once you leave this path, you will immediately run into mistake after mistake. Of the cases above, a few arose only because I was using a different operating system than most other Emacs users had at that time. Most of the issues arose because I was using the available functionality in different ways than it had been used in the past, or because I was the first to notice and report them.

Second, the functionality of a program is in a way like an organic muscle: If you stress it in a systematic way, it tends to get stronger over time.

With these opening words, happy M-x report-emacs-bug RETting!

Main page