The other day I was working on some sample code to test out an idea that involved an object with an internal nested array. This is a pretty common pattern in PHP: You have some simple one-off internal data structure so you make an informal struct using PHP associative arrays. Maybe you document it in a docblock, or maybe you're a lazy jerk and you don't. (Fight me!) But really, who bothers with defining a class for something that simple?
But that got me wondering, is that common pattern really, you know, good? Are objects actually more expensive or harder to work with than arrays? Or, more to the point, is that true today on PHP 7 given all the optimizations that have happened over the years compared with the bad old days of PHP 4?
So like any good scientist I decided to test it: What I found will shock you!
Benchmark environment
My test system is a Lenovo X1 Carbon 2017 Edition, i5-7300U CPU @ 2.60GHz, 16 GB of RAM, running Kubuntu 18.04. The PHP version is 7.2.5-0ubuntu0.18.04.1
. XDebug is disabled. (Always do that before running benchmarks!) I have as much background processing turned off as I could manage, though on modern systems runtime optimizations mean there will always be some variation and jitter.
You will almost certainly get different absolute numbers than I do but the relative values should be about the same.
Associative arrays (Baseline)
The baseline test looks like this:
<?php
declare(strict_types=1);
error_reporting(E_ALL | E_STRICT);
const TEST_SIZE = 1000000;
$list = [];
$start = $stop = 0;
$start = microtime(true);
for ($i = 0; $i < TEST_SIZE; ++$i) {
$list[$i] = [
'a' => random_int(1, 500),
'b' => base64_encode(random_bytes(16)),
];
}
ksort($list);
usort($list, function($first, $second) {
return [$first['a'], $first['b']] <=> [$second['a'], $second['b']];
});
$stop = microtime(true);
$memory = memory_get_peak_usage();
printf("Runtime: %s\nMemory: %s\n", $stop - $start, $memory);
That is, we build an array of 1 million items, where each item is an associative array containing an int and a short string. This "anonymous struct" is very typical of the type of data structure I'm talking about, which is often assigned to a private property within an object and only accessed within it. (Although some systems like to expose these anonymous structs as though they were an API, which is one of the most developer-hostile API designs I have ever seen. You know who you are.) 1 million items is somewhat larger than a typical use case but we want to stress test it, so go big or go home.
The goal is to measure the memory used by all of those nested arrays as well as the time it takes to process them. For that, we're sorting the array twice, once by the key (which should be a no-op) and once by the array itself, using a custom sort function.
As a second test, I also want to check the serialization size. These giant lookup tables are often built once and serialized to a database for cache lookup, so knowing the trade off there is also useful. For that we use this slightly different script:
<?php
declare(strict_types=1);
error_reporting(E_ALL | E_STRICT);
const TEST_SIZE = 1000000;
$list = [];
$start = $stop = 0;
$start = microtime(true);
for ($i = 0; $i < TEST_SIZE; ++$i) {
$list[$i] = [
'a' => random_int(1, 500),
'b' => base64_encode(random_bytes(16)),
];
}
$ser = serialize($list);
unserialize($ser);
$stop = microtime(true);
$memory = memory_get_peak_usage();
printf("Runtime: %s\nMemory: %s\nSize: %s\n", $stop - $start, $memory, strlen($ser));
To account for natural jitter in the process, I ran each test once to prime it (although on the CLI that shouldn't matter, but it doesn't hurt). Then I run three more times in a row and average the results. Here's the results for our baseline test:
Associative array (Sorting)
Run | Runtime (s) | Memory (bytes) |
---|---|---|
1 | 9.4488079547882 | 541450384 |
2 | 9.8389720916748 | 541450384 |
3 | 9.0056548118591 | 541450384 |
Avg | 9.4311 | 541450384 |
Associative array (Serialize)
Run | Runtime (s) | Memory (bytes) | Size |
---|---|---|---|
1 | 1.8638360500336 | 1100384368 | 68673068 |
2 | 1.8579361438751 | 1100384368 | 68672734 |
3 | 1.8860640525818 | 1100388464 | 68673514 |
Avg | 1.8692 | 1100385733 | 68673105 |
So about 9.4 seconds and a half GB of memory to work with associative arrays. The serialized form is 68 MB. The runtime is pretty stable and the memory usage is constant, as expected. (The slight variation is most likely due to randomly generated numbers of different length.) Those are the values to beat.
stdClass
For completeness let's switch to a stdClass
object. I predicted this would be about the same as structurally stdClass
objects are basically associative arrays that pass by handle instead of by value. Here's the new tests (the boilerplate start and end parts omitted):
for ($i = 0; $i < TEST_SIZE; ++$i) {
$o = new stdclass();
$o->a = random_int(1, 500);
$o->b = base64_encode(random_bytes(16));
$list[$i] = $o;
}
ksort($list);
usort($list, function($first, $second) {
return [$first->a, $first->b] <=> [$second->a, $second->b];
});
And here's the data:
stdClass (Sorting)
Run | Runtime (s) | Memory (bytes) |
---|---|---|
1 | 10.945838928223 | 589831120 |
2 | 11.50714302063 | 589831120 |
3 | 11.199006080627 | 589831120 |
Avg | 11.2173 | 589831120 |
stdClass (Serialize)
Run | Runtime (s) | Memory (bytes) | Size |
---|---|---|---|
1 | 3.1958901882172 | 1210154464 | 81672386 |
2 | 3.3245379924774 | 1210154464 | 81673031 |
3 | 3.2109470367432 | 1210154464 | 81673730 |
Avg | 3.2437 | 1210154464 | 81673049 |
Huh. I expected the serialized version to be a bit bigger as it needs to store the string "stdClass" over and over again. I didn't expect it to also be measurably slower and less memory efficient than associative array. It's not a massive difference, and at smaller cardinality it probably wouldn't be measurable, but it's definitely there.
Why does anyone use stdClass
again?
Object with public properties
Now let's get into the real test. In this case we'll predefine a class to use for our list and use two public properties on it. PHP doesn't support typed properties in PHP 7.2 (although it looks like it probably will in an upcoming version), but it does still do various optimizations to object structures when it knows the properties in advance. Let's see if those optimizations pan out in practice.
Here's our test code:
class Item
{
public $a;
public $b;
}
for ($i = 0; $i < TEST_SIZE; ++$i) {
$o = new Item();
$o->a = random_int(1, 500);
$o->b = base64_encode(random_bytes(16));
$list[$i] = $o;
}
ksort($list);
usort($list, function($first, $second) {
return [$first->a, $first->b] <=> [$second->a, $second->b];
});
And the data:
Public properties (Sorting)
Run | Runtime (s) | Memory (bytes) |
---|---|---|
1 | 8.1981730461121 | 253831584 |
2 | 8.0346500873566 | 253831584 |
3 | 8.4190359115601 | 253831584 |
Avg | 8.2172 | 253831584 |
Public properties (Serialize)
Run | Runtime (s) | Memory (bytes) | Size |
---|---|---|---|
1 | 3.096804857254 | 1326154736 | 77673599 |
2 | 3.0712831020355 | 1326154736 | 77672792 |
3 | 3.0746259689331 | 1326154736 | 77672696 |
Avg | 3.081 | 1326154736 | 77673029 |
BOOM! For sorting, a proper classed object is measurably faster than an array but the big difference is on memory. It uses half as much memory as the array version did. Half.
Serialization didn't fair quite so well. It's about on par with stdClass
time-wise but a bit more efficient space-wise. I strongly suspect that's because the string "Item" is shorter than "stdClass", which gets repeated over and over in the serialized value. That's something to note if dealing with a namespaced class as then the serialized class name can be quite long.
Object with private properties
A lot of people (like yours truly) preach against using public properties, though, in favor of protected properties and methods. That does introduce more method calls into our test, though. How will that fare?
Here's the new test code:
class Item
{
protected $a;
protected $b;
public function __construct(int $a, string $b)
{
$this->a = $a;
$this->b = $b;
}
public function a() : int { return $this->a; }
public function b() : string { return $this->b; }
}
for ($i = 0; $i < TEST_SIZE; ++$i) {
$list[$i] = new Item(random_int(1, 500), base64_encode(random_bytes(16)));
}
ksort($list);
usort($list, function(Item $first, Item $second) {
return [$first->a(), $first->b()] <=> [$second->a(), $second->b()];
});
And the data:
Private properties (Sorting)
Run | Runtime (s) | Memory (bytes) |
---|---|---|
1 | 11.160441160202 | 253833000 |
2 | 10.926701068878 | 253833000 |
3 | 11.177386045456 | 253833000 |
Avg | 11.0881 | 253833000 |
Private properties (Serialize)
Run | Runtime (s) | Memory (bytes) | Size |
---|---|---|---|
1 | 3.2856619358063 | 1332152352 | 83672594 |
2 | 3.1651678085327 | 1332152352 | 83672048 |
3 | 3.2460420131683 | 1332152352 | 83672899 |
Avg | 3.2322 | 1332152352 | 83672513 |
As predicted, adding methods to the mix slows it down a bit. The memory usage is very close to the public property version. Somehow the serialized version got a little bit slower and larger, but not dramatically. Again, at lower cardinality it would probably not be measurable.
Anonymous classes
Of course, some people are allergic to defining classes. I don't know why but they still view it as a slow and expensive thing to do. Maybe they're concerned about file count (given that PHP by convention uses file-per-class structure, although nothing in the langauge mandates that). For completeness, though, let's define an anoymous class inline and see how it measures up. We'll only do the public-property version as we know that adding methods will slow it down a tad.
One thing to note, however, is that anonymous classes cannot be serialized. If you need to serialize your data structure then anonymous classes are a no-go. We'll skip that test, of course.
Here's the code:
for ($i = 0; $i < TEST_SIZE; ++$i) {
$o = new class(random_int(1, 500), base64_encode(random_bytes(16))) {
public $a;
public $b;
public function __construct(int $a, string $b)
{
$this->a = $a;
$this->b = $b;
}
};
$list[$i] = $o;
}
And the data:
Anonymous class (Sorting)
Run | Runtime (s) | Memory (bytes) |
---|---|---|
1 | 8.0319430828094 | 253832368 |
2 | 7.9839849472046 | 253832368 |
3 | 8.3128731250763 | 253832368 |
Avg | 8.1095 | 253832368 |
Right in the same neighborhood as the named class, give or take. So for about the same performance and no ability to serialize it, you don't need to define a class by name. I'm sure someone will argue that is a good trade off but that someone would not be me.
Summary
Here's our final data, showing the percent change relative to our baseline for each value (negative number means decrease, which is good):
Summary (Sorting)
Technique | Runtime (s) | Memory (bytes) |
---|---|---|
Associative array | 9.4311 (n/a) | 541450384 (n/a) |
stdClass | 11.2173 (+18.94%) | 589831120 (+8.94%) |
Public properties | 8.2172 (-12.87%) | 253831584 (-53.12%) |
Private properties | 11.0881 (+17.57%) | 253833000 (-53.12%) |
Anonymous class | 8.1095 (-14.07%) | 253832368 (-53.12%) |
Summary (Serialize)
Technique | Runtime (s) | Memory (bytes) | Size |
---|---|---|---|
Associative array | 1.8692 (n/a) | 1100385733 (n/a) | 68673105 (n/a) |
stdClass | 3.2437 (+73.53%) | 1210154464 (+9.98%) | 81673049 (+18.93%) |
Public properties | 3.081 (+64.83%) | 1326154736 (+20.52%) | 77673029 (+13.11%) |
Private properties | 3.2322 (+%72.92) | 1332152352 (+21.06%) | 83672513 (+21.84%) |
What can we conclude from all of this?
First off, a reminder that we're dealing with a cardinality of 1 million here. That means if your cardinality is 4, odds are you won't notice an earth-shattering difference no matter what you do. However, it's still good to get into good habits in case your cardinality does grow considerably.
The first thing we can conclude is that if the one and only thing you care about is serialization/deserialization performance, associative arrays still win. They're the most time efficient by more than 50%, and the most space efficient by up to 20%.
The second thing we can conclude is that stdClass
should be used basically never. It's slower and more memory intensive than arrays in every circumstance. Just don't go there.
In just about every other situation I can think of, named classes win. Their memory usage is half that of a corresponding array. The optimizations the engine can do when it knows up front what the structure of your data is going to be are massive and pay off huge dividends in memory consumption. They're also over 10% faster. The only downside is when trying to serialize them when there is an added cost to time, memory, and stored size. When we also consider that a classed object is far more self-documenting than an associative array, gives IDEs the ability to auto-complete for you, and gives you a place to include additional documentation (which you should include), it's one of the clearest wins I've seen in PHP.
In other words, if you're one of those people who claims that "good code is self-documenting, you don't need comments", and you're not using a classed object, then you're not just wrong, you're a hypocrite who's also wrong. Don't be that person.
The question of public properties vs methods is, I would argue, open. They do offer a more structured, self-documenting, more flexible approach but at the same time do have a hefty CPU penalty over associative arrays. (They still destroy arrays on memory, though.) Whether that is a good trade off or not depends on your use case. My default recommendation would be, when we're talking about what is essentially a private class, use public properties for the main data but don't feel shy about adding additional methods to the object if you want to compute stuff off of it, or it makes sorting easier, or it somehow otherwise is helpful for your use case. Putting a constructor on the class so you can initialize it in a single line is probably a good idea, and I expect would be a wash performance-wise.
As another consideration, it's common these days for larger frameworks to generate code based on plugin information and store that on disk not as a serialized string but as a generated PHP class that can then be just loaded like any other. (Think Dependency Injection Containers, Event Dispatchers, theme systems where you can register template plugins, etc.) In that case the serialization point is moot and you have absolutely no excuse for not using a named class. Generating out a big nested associative array into your compiled code is just flat out inexcusably wasteful. Don't do that. Stop it.
Although I only ran the tests on PHP 7.2 I'm reasonably confident these results will hold back to PHP 7.0 and later. It's possible they would be different on PHP 5, but since all versions of PHP 5 will be fully unsupported within 6 months I really don't care if they're applicable.
tl;dr: Use named classes with public properties for big internal data structures. If you're still using nested associative arrays for that, You're Doing It Wrong(tm).
I always start with arrays for quick prototyping then I jump back to objects for storing the same data. Not only because I suspected it would be faster (because of the class definition) but because the data I'm sharing with has their own methods that knows how to deal with that data. Here is an example that I moved array structures into their own class, the code is much nicer and it runs a bit faster if you measure a few million times.
Interesting article that confirms my theory :D, thanks for writing it.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Nice! Yeah, the ability to encapsulate behavior is one of the most obvious benefits of a class but there's been a general belief in PHP for years that doing so was more expensive than doing it "manually". That may have been true once, but it's definitely not true today. In fact quite the opposite.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
An addendum, as a few people have pointed out to me on Twitter:
This applies to runtime behavior. PHP has another optimization where, if you define an array as a
const
it gets placed in shared memory with the code, so the net memory cost to each process using that array is 0.That's really only applicable if:
In that case, a
const
big nested array may indeed be better both for CPU and memory.The runtime builder for that compiled code, though, is still better off using objects for memory efficiency so that you can produce that compiled code.
As always, context matters. :-)
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Great write up, Larry. I won’t fight you. You made a good argument.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Oh good. We have enough things to fight about. I'd hate to add programming optimization to the list. :-)
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Nice benchmark !
And what about extending
Serializable
on the named class to still store it as an associative array ?Is it the best win-win combo ? Of course we need to ask if defining serialization for simple data struct is relevant 😊.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
My guess is it would be slower because it has to call serialize/deserialize in user-space for each class. It might end up being smaller but the performance cost is likely not worth it. That said, I haven't tried.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
How about using the array_multisort() instead of usort():
Running on a mac book pro, 2.2GHz Intel Core i7, 16GB, listing Av. of 3 runs:
Associative array (Sorting)
Object with public properties (sorting):
a tradeoff between memory and runtime ...
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Interesting observation! If you're sorting an array, yes, that would make a big difference. However, the purpose of
usort()
here was to provide a direct comparison between objects and arrays, so they had to be used in the same way. That meantusort()
so that we could compare the property access in each. I didn't as much care about the sorting itself as sorting was an easy way to call$array['a']
and$object->a
a few zillion times. :-)Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
This is rather older, but here's a post from Nikita Popov explaining the difference in storage in PHP 5.4: https://gist.github.com/nikic/5015323
The structs have changed dramatically in PHP 7, but the basic optimization he describes is still with us, and is the reason for these results.
Some more recent posts on the topic, too:
https://nikic.github.io/2011/12/12/How-big-are-PHP-arrays-really-Hint-BIG.html
https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Hi there:
I really tried to use model classes but it is impractical.
Let's say we want to json_serialize. Ok, it is not a problem. But what if we have a field that it's composed by another model
Serializing it's not fun. However, de-serializing (json) is a big challenge because the system doesn't understand the field $typeCustomer if an object and it de-serialize as stdClass, then every method attached to TypeCustomer fails.
https://dev.to/jorgecc/php-is-bad-for-object-oriented-programming-oop-282a
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Very good post. I often had these issues with associative arrays while writung the code for websites like https://www.receivesms.co
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit