I had an interesting day. Long story short:
I am a doctor and a patient came to me. I quickly realized that he is dehydrated and needs to drink something, but obviously there is some problem. I could quickly sort him out with infusion, but I was just too curious to do so. I found out that he did not drink for 10 days. What a horrible state! He even tried to drink every 5 minutes, but he just couldn't. After a while it turned out that his pharynx is clogged. Fortunately the patient has three more pharynges, but he did not try to use them. I could teach him to drink with some other pharynx, I could clean his clogged pharynx, but... later. I just wanted to know the cause of pharynx clogging.
I looked into the drawers and found some tools, which I have never used before. I just heard that some other doctors use them regularly. But I am out of luck - other doctors have no time for me. So I took the tools and tried to use them on the patient while reading manuals. It turns out that the pharynx sometimes gets clogged when the tongue is in a special position. It is very rare case and the impact is negligible as the cleaning of the pharynx suffices to solve this rare issue. So I cleaned the pharynx and ordered him a transplantation of the old tongue branded Akka 2.4.2 with a new one Akka 2.4.20. The patient drank a lot of water and is satisfied. He will never get his pharynx clogged because of his tongue again.
... and now the real story:
I have found a CentOS server with high CPU load (100%), so right after I started investigating what is going on. I am just a simple developer, not an admin. Running top
showed that really my java application (pid 1112) is causing the issue. But I have not touched it for years, so why now?
At first, I wanted to solve it by restarting the application. But instead, the curious part of me stops me from doing so. So I asked some admins in our company for help. I just wanted to know the cause. But admins are often busy, so I started googling.
vmstat
is a nice utility showing, what is going on with the system. The output was like this:
$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 24288844 252676 2857556 0 0 0 1 1222 275 6 19 75 0 0
1 0 0 24288844 252676 2857556 0 0 0 12 1224 276 6 19 75 0 0
1 0 0 24288844 252676 2857556 0 0 0 6 1205 285 6 19 75 0 0
...
The column id
displays "75", so 75% of the time the CPU is idle. Out of four CPU cores, one is at it's max and others are just sitting there looking at the first core carrying all the burden for 10 days.
To show which thread is causing the load, I ran top -H
. It shows that from all threads, one with id 1138 is causing all the load.
With jstack 1138
I tried to find out what the thread is doing, with no success:
# jstack 1138
1138: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
Running jstack -F 1138
turns out to be better result, although the result contained all the threads of the process 1112, not just 1138:
# jstack -F 1138
Attaching to process ID 1138, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.60-b23
Deadlock Detection:
No deadlocks found.
...
Thread 1138: (state = IN_JAVA)
Thread 1137: (state = BLOCKED)
...
Then I tried kill -3 1138
, which appended something to the standard error log. It was the full thread dump. Thread ids are in hexa (1138 = 0x472)
2017-12-13 14:08:40
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode):
...
"ActorSystem-scheduler-1" #14 prio=5 os_prio=0 tid=0x00007fddb4f58800 nid=0x472 runnable [0x00007fdda4124000]
java.lang.Thread.State: RUNNABLE
at akka.dispatch.AbstractNodeQueue.pollNode(AbstractNodeQueue.java:98)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:266)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:278)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:230)
at java.lang.Thread.run(Thread.java:745)
...
Peeking into the source code of akka.dispatch.AbstractNodeQueue
I found out it caused by some consumer of some queue:
/**
* Lock-free MPSC linked queue implementation based on Dmitriy Vyukov's non-intrusive MPSC queue:
* http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue
*/
@SuppressWarnings("serial")
public abstract class AbstractNodeQueue<T> extends AtomicReference<AbstractNodeQueue.Node<T>> {
...
/*
* Use this method only from the consumer thread!
*/
@SuppressWarnings("unchecked")
public final Node<T> pollNode() {
Node<T> tail;
Node<T> next;
for(;;) {
tail = ((Node<T>)Unsafe.instance.getObjectVolatile(this, tailOffset));
next = tail.next();
if (next != null || get() == tail)
break;
}
if (next == null) return null;
else {
tail.value = next.value;
next.value = null;
Unsafe.instance.putOrderedObject(this, tailOffset, next);
tail.setNext(null);
return tail;
}
}
Dmitriy Vyukov here states the following:
Advantages:
- Waitfree and fast producers. One XCHG is maximum what one can get with multi-producer non-distributed queue.
- Extremely fast consumer. On fast-path it's atomic-free, XCHG executed per node batch, in order to grab 'last item'.
- No need for node order reversion. So pop operation is always O(1).
- ABA-free.
- No need for PDR. That is, one can use this algorithm out-of-the-box. No need for thread registration/deregistration, periodic activity, deferred garbage etc.
Disadvantages:
- Push function is blocking wrt consumer. I.e. if producer blocked in (), then consumer is blocked too. Fortunately 'window of inconsistency' is extremely small - producer must be blocked exactly in (). Actually it's disadvantage only as compared with totally lockfree algorithm. It's still much better lock-based algorithm.
- The algorithm is not linearizable.
Notes:
MPSC = multiple producers, single consumer
XCHG = instruction to swap values in registers/memory
ABA problem = John is waiting in his car at a red traffic light with his children. His children start fighting with each other while waiting, and he leans back to scold them. Once their fighting stops, John checks the light again and notices that it's still red. However, while he was focusing on his children, the light had changed to green, and then back again. John doesn't think the light ever changed, but the people waiting behind him are very mad and honking their horns now.
From the source code I think that the thread is in an infinite cycle:
for(;;) {
tail = ((Node<T>)Unsafe.instance.getObjectVolatile(this, tailOffset));
next = tail.next();
if (next != null || get() == tail)
break;
}
In every cycle the next == null and get() != tail and the for cycle never ends.
Fortunately, I have found this commit: https://github.com/akka/akka/commit/f9d3789bfca6eb08fb2dcf9cc7b1dc1643e099e6#diff-1aa2e912a1b540c9cf0ffdb046e38bfd, where the author of the previous implementation Roland Kuhn (Leader of Akka 2012-2016) fixed the method to:
/**
* Pull one item from the queue, returning it within a queue node.
*
* Use this method only from the consumer thread!
*
* @return queue node with element inside if there was one, or null if there was none
*/
@SuppressWarnings("unchecked")
public final Node<T> pollNode() {
final Node<T> tail = (Node<T>) Unsafe.instance.getObjectVolatile(this, tailOffset);
Node<T> next = tail.next();
if (next == null && get() != tail) {
// if tail != head this is not going to change until producer makes progress
// we can avoid reading the head and just spin on next until it shows up
do {
next = tail.next();
} while (next == null);
}
if (next == null) return null;
else {
tail.value = next.value;
next.value = null;
Unsafe.instance.putOrderedObject(this, tailOffset, next);
tail.setNext(null);
return tail;
}
}
It is further discussed in https://github.com/akka/akka/issues/19949. The commit is included in Akka 2.4.3. After upgrading to newest Akka 2.4.* the issue never showed up again.
Congratulations @mirelon! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
You published your First Post
You made your First Comment
You got a First Vote
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Resteemed by @resteembot! Good Luck!
The resteem was payed by @greetbot
Curious?
The @resteembot's introduction post
Get more from @resteembot with the #resteembotsentme initiative
Check out the great posts I already resteemed.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit