scaling programs

| No Comments | No TrackBacks
Theodore Ts'o began saying how we used to write code so it could scale up to a lot of CPUs, and then we stopped worrying about it. Back in 2001 the linux kernel started having SMP, so support for multiple CPUs. Used to have computers with a lot of sockets for a lot of CPUs, and those systems were very expensive.

SMP is difficult, because if computers share memory, they need to make sure they coordinate about the memory (ie, make sure a cache in one CPU doesn't have old information about whats in memory because another CPU wrote to it). Then they tried to implement NUMA (so CPUs prefer memory that is closer to them), but that's complicated, and can cause performance hits.

So a machine with 4 sockets costs a lot more than a machine with 1 socket. There are a bunch of benchmarks for single CPU systems to work out how fast a system works. But all the different benchmarks test all sorts of different things. To mention SMP scalability S = (score on N cpus / score on 1 CPU). Ie, Linux 2.6 scales to 12 out of 16 CPUs on the blah benchmark. 12 out of 16 is considered pretty good. But far short of the 100% scalling up of processing that you'd desire.

In early 2000's theere was a linux scalability effort. They had weekly conference calls. There were a lot of companies involved. They had weekly/monthly bnchmark measurements by a performance team. So this occurred for 2 to 3 years, then decided that it was good enough. At the start of 2001 we had really poor benchmarks upto 4 CPUs. After this we got upto 6-7 out of 8, or 12 out of 16, and an acceptable number of CPU's on 32 CPU systems.

But this effort died down for several reasons. One of those reasons was that people spending big money on high end gear preferred to run the high end legacy OS's. And linux succeeded wildly on x86, which had very few servers that had more than 8 to 16 CPUs. Linux was used for scale-out computing (running lots of separate systems). And for approximately the next 4 to 5 years nothing was really done for scalability. During this time we saw the rise of Linux on embedded and mobile equipment.

But then CPU frequencies haev stopped doubleing every 12 months, and now we are seeing in mainstream CPUs that have multiple cores. So scalability is starting to matter again. So Theodore suggests that it's time for kernel programmers to think about scalability testing, and for application programmers to think about multiprocessor programming.

He moved on to talking about ext3. Historically most workloads don't really stress the file system, other bottlenecks are hit first. In Enterprise databases are tending to use Direct I/O to preallocated files. And ext3 was pretty good for these direct I/O. But ext3 doesn't perform well in benchmark competitions. But as it wasn't the bottle neck, so system administrators didn't care because it worked, and it was easy to service if things went wrong.

For ext4, in April 2010 IBM's real time team was trying to make it better. They found they were spending 90% of their time on spinlocks. This was in the journal start and stop functions. Theodore said you need to document which lock is used to protect a variable, and what order the locks need to be applied.

As transactions are expensive, in jbd2 (the journalling code for April 2010 ext4) they gruop multiple file system operations into a single transaction. Transaction commits happen every 5 seconds, or when the journal is full. So each file system opration is bracketed by some jbd2_journal start or stop function call. And those calls need to grab some locks to manage some meta data. He found the j_state_lock spinlock was apparently not being used to protect any data. So removing that lock immediately improved performance.

He showed some graphs of how the patch improved performance. The graphs included xfs, which showed xfs was heaps better. He moved onto a benchmark/tool that shows information about how locks were used. But I didn't understand which locks were being referred to. He made some changes to the jbd2 locks, and performance got better.

Then the next problem was that the ext4 layer was submitting writes to the block layer in a 4k block at a time. The block layer would merge them together, but that work meant there was more CPU work doing that. To fix this required quite a large overhald and cleanup.

He finished up by mentioning a bug that needs fixing, and that we need to start thinking about multithreaded programming again. A lot of what h talked about can be used in application programming. He suggested having a look at valgrind's drd tool to find data races, and Lennart Poettering's mutrace tool.

No TrackBacks

TrackBack URL:

Leave a comment

About this Entry

This page contains a single entry by Geoff Crompton published on January 26, 2011 12:30 PM.

Behaviour Driven Infrastructure was the previous entry in this blog.

node.js is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.



Powered by Movable Type 4.23-en