16 Minutes to Build a QList

It took CopperSpice almost 16 minutes to build the same QList Qt built in under a second. I have spent more than a day creating the test case and running the test data. It is currently 6:24pm local time. I started my callgrind test of this at about a quarter to 2pm. It’s still running.

Callgrind start time

Here are the current stats.

roland@roland-HP-EliteDesk-800-G2-SFF:~/sf_projects/ListTest_debug$ callgrind_control -e -b
 PID 81466: ./listtest
 sending command status internal to pid 81466
 Totals:               Ir 
    Th 1  2,138,722,737,075 
    Th 2         38,724,327 
 Frame:                Ir Backtrace for Thread 1
    [ 0]    180,647,087,012 __gnu_cxx::__normal_iterator > >::base() const (25806726717 x)
    [ 1]    903,232,403,521 bool __gnu_cxx::operator!= > >(__gnu_cxx::__normal_iterator > > const&, __gnu_cxx::__normal_iterator > > const&) (25806640101 x)
    [ 2]  1,935,503,204,513 long CsString::utf8::distance<__gnu_cxx::__normal_iterator > > >(__gnu_cxx::__normal_iterator > >, __gnu_cxx::__normal_iterator > >) (86616 x)
    [ 3]  1,935,515,065,506 CsString::CsBasicString >::size() const (86605 x)
    [ 4]  1,935,493,582,772 QString8::size() const (86604 x)
    [ 5]  2,137,931,512,290 MainWindow::tokenize(QString8) (1 x)
    [ 6]  2,138,197,350,792 MainWindow::startTest() (1 x)
    [ 7]  2,138,197,350,814 void CsSignal::Internal::cs_unpack_method_args_internal(MainWindow, void (MainWindow::)(), std::tuple<> const&, std::integer_sequence) (1 x)
    [ 8]  2,138,197,350,832 CsSignal::Internal::CSVoidReturn CsSignal::Internal::cs_unpack_method_args(MainWindow, void (MainWindow::)(), std::tuple<> const&) (1 x)
    [ 9]  2,138,197,351,622 CsSignal::Internal::Bento::invoke(CsSignal::SlotBase, CsSignal::Internal::TeaCupAbstract const) const (1 x)
    [10]  2,138,197,353,213 void CsSignal::activate(QAbstractButton&, void (QAbstractButton::)(bool), bool&) (1 x)    [11]  2,138,197,354,342 _dl_runtime_resolve_xsave (1 x)    [12]  2,138,197,354,349 0x00000000052764c0 (1 x)    [13]  2,138,197,354,436 QAbstractButtonPrivate::emitClicked() (1 x)    [14]  2,138,197,355,289 _dl_runtime_resolve_xsave (1 x)    [15]  2,138,197,355,296 0x000000000528e210 (1 x)    [16]  2,138,197,639,154 QAbstractButtonPrivate::click() (1 x)    [17]  2,138,197,639,951 _dl_runtime_resolve_xsave (1 x)    [18]  2,138,197,639,958 0x0000000005294760 (1 x)    [19]  2,138,197,640,057 QAbstractButton::mouseReleaseEvent(QMouseEvent) (1 x)
    [20]  2,138,208,742,219 QWidget::event(QEvent) (320 x)    [21]  2,138,198,960,804 0x000000000525e110 (64 x)    [22]  2,138,200,447,252 QAbstractButton::event(QEvent) (64 x)
    [23]  2,138,200,446,077 0x000000000526ef20 (60 x)
    [24]  2,138,200,431,643 QPushButton::event(QEvent) (52 x)    [25]  2,138,213,997,253 QApplicationPrivate::notify_helper(QObject, QEvent) (187 x)    [26]  2,138,198,174,253 0x000000000528fb40 (2 x)    [27]  2,138,214,176,491 QApplication::notify(QObject, QEvent) (144 x)    [28]  2,138,210,280,964 QCoreApplication::notifyInternal(QObject, QEvent) (115 x)    [29]  2,138,198,412,550 0x00000000052940b0 (46 x)    [30]  2,138,198,439,141 QApplicationPrivate::sendMouseEvent(QWidget, QMouseEvent, QWidget, QWidget, QWidget, QPointer&, bool) (45 x)    [31]  2,138,198,453,527 0x000000000528ece0 (46 x)    [32]  2,138,198,628,219 QWidgetWindow::handleMouseEvent(QMouseEvent) (45 x)
    [33]  2,138,198,654,192 0x00000000052574a0 (46 x)
    [34]  2,138,206,007,189 QWidgetWindow::event(QEvent) (59 x)    [35]  2,138,215,800,987 QApplicationPrivate::notify_helper(QObject, QEvent) (424 x)    [36]  2,138,215,810,580 0x000000000528fb40 (425 x)    [37]  2,138,215,920,651 QApplication::notify(QObject, QEvent) (425 x)    [38]  2,138,209,267,733 QCoreApplication::notifyInternal(QObject, QEvent) (294 x)    [39]  2,138,198,682,215 0x00000000052940b0 (46 x)    [40]  2,138,198,681,680 QApplicationPrivate::processMouseEvent(QWindowSystemInterfacePrivate::MouseEvent) (46 x)
    [41]  2,138,198,713,456 0x000000000525ce00 (47 x)
    [42]  2,138,207,133,489 QApplicationPrivate::processWindowSystemEvent(QWindowSystemInterfacePrivate::WindowSystemEvent*) (54 x)
    [43]  2,138,207,149,986 0x000000000527f9e0 (55 x)
    [44]  2,138,198,758,614 QWindowSystemInterface::sendWindowSystemEvents(QFlags) (98 x)
    [45]  2,138,207,170,943 0x0000000009387590 (99 x)
    [46]  2,138,214,384,853 QUnixEventDispatcher::processEvents(QFlags) (99 x)
    [47]  2,138,202,168,869 QEventLoop::processEvents(QFlags) (98 x)
    [48]  2,138,214,389,585 0x0000000007d8a820 (99 x)
    [49]  2,138,214,393,930 QEventLoop::exec(QFlags) (1 x)
    [50]  2,138,214,394,913 _dl_runtime_resolve_xsave (1 x)
    [51]  2,138,214,394,920 0x0000000007d89e50 (1 x)
    [52]  2,138,214,398,749 QCoreApplication::exec() (1 x)
    [53]  2,138,214,399,613 _dl_runtime_resolve_xsave (1 x)
    [54]  2,138,214,399,620 0x0000000005274480 (1 x)
    [55]  2,138,214,400,492 QApplication::exec() (1 x)
    [56]  2,138,214,400,494 0x0000000000112c80 (1 x)
    [57]  2,138,652,121,489 main (1 x)
    [58]  2,138,652,305,048 (below main) (1 x)
    [59]  2,138,652,305,060 _start (1 x)
    [60]                  . 0x0000000000001100
 Frame:                Ir Backtrace for Thread 2
    [ 0]            154,047 poll (3218 x)
    [ 1]            161,333 0x00000000094b1560 (3219 x)
    [ 2]          7,224,790 0x000000000000bb80 (3027 x)
    [ 3]          7,750,239 xcb_wait_for_event (2874 x)
    [ 4]          7,953,840 0x0000000009385040 (2875 x)
    [ 5]         38,695,270 QXcbEventReader::run() (1 x)
    [ 6]         38,722,486 QThreadPrivate::start(void*) (1 x)
    [ 7]         38,724,320 start_thread (1 x)
    [ 8]                  . clone

Here is the code:

38 Downloads

28 Downloads

The Qt version is a CMake project. I developed this on Ubuntu 20.04 LTS using Qt version from the repo. My CopperSpice build was current as about a week or so ago.

Under Qt I got what I expected. Sub-second list build time.

Qt test run

Under CopperSpice, not so much.

CopperSpice Test Run

Okay, so it was only 15 minutes and change, not 16 minutes. Are you going to quibble about that last minute when an application has already hung for more than 10?

This isn’t a random bogus test. I was porting the Gede debugger to CopperSpice and the text file included in the source directory for use in testing contains a gdb response. (Test application looks for text file in $HOME.)

(gdb) 
 -file-list-exec-source-files

In one giant formatted string gdb sends back all of the source files relevant for this debug session. Not the files themselves, but the full path and name. Any front end needs to be aware of what is in play. Gede (and I imagine most gdb front ends) tokenize this stuff so they can deal with it. You will see the tokenize() routine is basically the same. The CopperSpice version has some additional qDebug() calls so I could actually see something.

There appears to be something horribly inefficient in either QString8 or QList.

[ 0]    180,647,087,012 __gnu_cxx::__normal_iterator > >::base() const (25806726717 x)
    [ 1]    903,232,403,521 bool __gnu_cxx::operator!= > >(__gnu_cxx::__normal_iterator > > const&, __gnu_cxx::__normal_iterator > > const&) (25806640101 x)
    [ 2]  1,935,503,204,513 long CsString::utf8::distance<__gnu_cxx::__normal_iterator > > >(__gnu_cxx::__normal_iterator > >, __gnu_cxx::__normal_iterator > >) (86616 x)
    [ 3]  1,935,515,065,506 CsString::CsBasicString >::size() const (86605 x)

I haven’t noticed a huge problem with QString8 so I’m assuming the issue is in the QList code. When working on my Diamond port I noticed really sluggish behavior with syntax highlighting that was using lists of regular expressions.

Hopefully this somewhat compact example will help the team track it down. I will try to update this post tomorrow with the completion screen shot. Assuming it completes overnight.

You can read more about my CopperSpice experiments in this post thread.

Edit – 2021-02-24 – 4:30am

I love Callgrind and Valgrind. You run something under them and days later it gets done. As of 4:30am this morning this appears to be no closer to completion. I do have newer stats at the top to share.

roland@roland-HP-EliteDesk-800-G2-SFF:~/sf_projects/ListTest_debug$ callgrind_control -e -b
 PID 81466: ./listtest
 sending command status internal to pid 81466
 Totals:               Ir 
    Th 1  5,208,933,012,211 
    Th 2         47,402,648 
 Frame:                Ir Backtrace for Thread 1
    [ 0]  1,977,743,839,393 bool __gnu_cxx::operator!= > >(__gnu_cxx::__normal_iterator > > const&, __gnu_cxx::__normal_iterator > > const&) (56506966840 x)
    [ 1]  4,238,033,891,625 long CsString::utf8::distance<__gnu_cxx::__normal_iterator > > >(__gnu_cxx::__normal_iterator > >, __gnu_cxx::__normal_iterator > >) (189644 x)
    [ 2]  4,238,059,867,454 CsString::CsBasicString >::size() const (189633 x)
    [ 3]  4,238,039,415,000 QString8::size() const (189632 x)

Copy on Write

While I’m holding out hope this is just a horrible bug in QList, my gut tells me this is really do to a fundamental change between Qt and CopperSpice. Qt has a “copy on write” policy implemented under the hood for QObject based things. I haven’t looked at the actual code in quite some time but here is the gist of it.

When you assign QObject based entity A to entity B there is just some reference magic manipulated under the hood. There is no need to actually copy the values until something alters entity A or B. This makes things very fast.

You can also take a big hit if there happens to be 100+ things referring to this particular value instance when it needs to change. Think a working/scratch object you load a “default” value into from some external source then use to initialize a hundred element list. Not an integer, but a substantial object like a page of text or a QPixmap image. Later on in the code your working QPixmap needs to load a different image. That image pays the price. A hundred copies now have to be made before the first image change can happen. The second image change pays no such price so it is very fast. Beginners chase their tail trying to figure out why the first one takes so long. This is why.

Improve Battery Life

This minimal use of dynamic memory is incredibly important today in battery operated embedded systems. Many of them have lower powered slower everything inside them because they are trying to add minutes or hours to battery life. You have only two options when it comes to extending battery life.

  1. Use less juice
  2. Buy a bigger battery.

Option one is cheaper and doesn’t add pounds to the device. That’s the route most device manufacturers try to go first. The result of this is you can be using the exact same Linux kernel and exact same compiler for your target device and dynamic memory allocation will suck.

The XML Trap

A former coworker who became manager didn’t believe me when I said this. All of his embedded systems work had been gambling machines connected to grid power. Manufacturers spared no expense in speeding those devices up and they didn’t give a hoot about how much juice it took. The manager who would be retiring was just nodding his head in agreement with me as I spoke this.

We had to preload our images on that device. Multiple images for one screen came in a single SVG file and you loaded them via their tag. SVG is really XML so it isn’t such a big deal. What was a big deal was someone deciding that code needed to be re-ordered so it loaded images alphabetically instead of a screen at a time. I think the power-on boot time climbed to something like 45 minutes. You didn’t power off unless you had to.

Re-ordering to load all images from a single file before moving on to the next trimmed our boot time down to three minutes. We obviously still had a ways to go, but it was now obvious where to look. Remove all needless dynamic memory allocation. Every time we opened an SVG file a plethora of XML class objects had to be dynamically created and fully populated. When we closed the file these all had to be reclaimed.

Dynamic memory allocation in most/many battery operated embedded systems sucks. Know this. Plan for it. Your life will be happier.

Desktops Used to Suck

NCR PC4

The NCR PC4 was the first computer I ever purchased for myself. A starving college kid, but I saved up the $1700+ to buy one. A whopping 4.77 MHZ and dual floppy disk drives with a ghosting green phosphorus monitor. Tipping the scales around 50 pounds if memory serves.

I didn’t play with Qt until Qt3 was out and then it was near the end of Qt3 waiting for Qt4. I’m telling you this so you can understand why copy on write was implemented.

  • Intel released the Celeron 533 MHz with a 66 MHz bus processor on January 4, 2000.
  • AMD first released the Duron processor on June 19, 2000, with speeds of 600 MHz to 1.8 GHz and bus speeds of 200 MHz to 266 MHz. The Duron was built on the same K7 architecture as the Athlon processor.

None of you would consider buying one of these computers today, but they were smoking fast when they were released. All kinds of hardware and OS level techniques were developed over the years to avoid actually having to transfer significant amounts of memory because we had limited abilities.

Don’t try and tell me that isn’t how things are now.

Remote controls

Just how much processor and RAM do you think is in one of these? How often do you have to change the 2-3 AA or AAA batteries powering them?

You see, production programmers used to control the various language specifications. They were interested in solving real world problems not pursuing pie-in-the-sky-once-in-a-blue-moon topics.

Enter the Academics

More and more production coders had to devote more time to production coding. Academics wriggled into the controlling positions of language standards. I always picture them dressed in a white robe of some kind, sitting cross legged, palms skyward, head tilted back, chanting “Oooooohhhhhhhmmmmmm” as they try to become one with the object.

While some good things came out of this initially, lots of bad is the most recent contribution. You can read the discussion about MVVM in my latest book to understand one of the more recent tragedies. At some point the academics took it upon themselves to declare copy-on-write a violation of the C++ standard and forbid it.

Under the hood CopperSpice (per the documentation and Web site) adheres to the C++ standard. My gut tells me this is the root of the beyond sucky performance our test application exhibits. Under a second for Qt and 15+minutes for CopperSpice.

Let’s Look at a Few Things

QList<Token *> MainWindow::tokenize( QString str )

Officially, assuming my memory is accurate, this statement requires both the parameter and return value to be passed by value. I don’t read the standard on a daily basis, I’ve been working with Qt a long time. Anyone who has will have a large base of code looking much like this. If you are thinking of porting this is going to be a major issue.

input file size

According to the ls command, the size of the input txt file is 291K. That’s in raw UTF-8. Things are in QChar32 under the hood in CopperSpice so we can agree that it is more than 300k in memory.

 QChar c = str[i];

Every time we do this we have to make a copy of the character. (The alternative would be dereferencing each and every place you find c being used in the code and hoping the compiler could optimize.)

 cur->m_text += c;

Every one of those statements is going to cause some reallocation. I can’t find it now, but I swear to God I remember reading about a compile/build option when building Qt from source where you could set the default initial allocation of all QString objects. It would be like you called reserve() every time you declared a QString.

I can’t find the parameter/option now. I ran into it on one embedded system project. Really sucky dynamic memory allocation had us looking for solutions. We knew that 15 characters would handle about 80% of the strings in the system. We had the RAM, it was the allocation price that was killing us. Forcing a reserve(15) into every string under the hood breathed new life into the application.

Due to C++’s type system and the fact that QString is implicitly shared, QStrings may be treated like ints or other basic types. For example:

https://doc.qt.io/qt-5/qstring.html

I’m going to steal this example from that doc as well.

QString Widget::boolToString(bool b)
 {
     QString result;
     if (b)
         result = "True";
     else
         result = "False";
     return result;
 }

The result variable, is a normal variable allocated on the stack. When return is called, and because we’re returning by value, the copy constructor is called and a copy of the string is returned. No actual copying takes place thanks to the implicit sharing.

https://doc.qt.io/qt-5/qstring.html

Yes, you can find raving discussions against CoW (Copy on Write as they use it) on places like StackOverflow. But there is now nearly 30 years of code treating objects like integers when passing things around because of the performace CoW gives us.

If you are trying to port an existing codebase to this library you are going to have to redesign. It may be the least painful to make compile and run, but it won’t run well with gotchas like these.

From the CopperSpice Web site

Implicit sharing is being purged from CopperSpice. You will note that QString and QList aren’t in the list of classes still supporting implicit sharing if you click that link.

Killed off Callgrind

It was 8:50am and still not done. If I let it run 24 hours my elapsed time thing wasn’t going to work.

I really wanted to get to the bottom of this. Everybody emptying their colons on the CoW way of life claimed you could solve all performance problems via reference. Just how much truth is there to such a wild claim?

changed method

I changed the method to be void, accepting the list to be created and the string to create it from via reference. Got a clean ninja install and ran. Removing these two copies should dramatically improve the time if there is any truth to that argument.

After changes

Yeah, 32 seconds slower according to that run. Next I replaced every list.push_back() call with:

list.append( cur );

Just in case this wasn’t as direct a replacement as the documentation says.

after change to .append()

Possibly saved 5 seconds or there was just less going on.

Yes, you are correct. I should have looked for a better profiling tool and spent days learning how to use it if I wanted to dig all the way to the bottom of this problem.

Having said that, there isn’t much going on here. If the Zen Masters are correct and this Holier more Pioused C++ way of doing things really is better, I should be able to change just one thing and see this drop down to at least under a minute, right?

Next I changed all of the cur->m_text += statements to be:

 cur->m_text.append( c );

Naturally the ones with hard coded characters kept their hard coded characters. Why take these baby steps? I want to get to the root cause without going into the CopperSpice code itself. In this case it would be going all the way into GNU if these really are wrappers on the STL. I wondered if there weren’t a few instances of bad code decisions adding to the performance issues.

Overloads could have caused confusion
Shaved another 12 seconds

You would be correct in believing that should not have fixed anything. Obviously the appropriate method wasn’t always being called.

Next I wanted to try reserve() but that wasn’t pious and holy enough.

I really don’t know what would have been the difference to leave reserve() in and just change the meaning of the parameter from bytes to numOfChars. That appears to be what they did with resize().

token.h change

Changed the Token constructor to pre-extend m_text by 50 characters. If the needle wiggles enough we can poke around looking for “magic size.”

It should be painfully obvious by now that one cannot fix this problem from the outside. Obviously resize() isn’t well implemented.

After resize tweak

So, we take the resize() code out of Token and just for grins add it here:

Try pre-extending the QList
No meaningful difference

Summary

This is a substantial problem. Most likely it is why I hear from so many who have tried CopperSpice that it is slow. I certainly noticed it working on my fork of Diamond.

While CoW may be considered something only the great unwashed masses use, the great unwashed masses actually get something done. When reserve() was removed from QString, QList, and wherever else, I don’t think it was really understood how those benefitted the community. Reserve allocated a bunch of empty units that append(), insert(), etc. would use until they were exhausted. This let you stall off the overhead of dynamic memory allocation for every object added. You could increase performance by knowing the size which could handle roughly 80% of your needs.

Judging from the elapsed time of the tests, it is pretty obvious that resize() doesn’t create this nice big contiguous buffer you can use. While this system isn’t a blinding make-the-gamer-kids-drool machine, it’s not a slouch.

Test system

There has been talk for years about Qt just becoming a wrapper over STL containers. I’ve heard talk on the mailing list about “Why should we spend the resources to maintain that which is already there?’

This would be the argument!

The STL does not perform CoW. Until there is some exponential speed improvement in STL containers they are not viable solutions once you get beyond trivial accademic applications.

This wasn’t even that difficult of a task! Creating a list of tokens from a string that is probably 500K in memory. Those looking to leave Qt need to take this sample application and re-implement it with whatever cross platform framework they are thinking about using.