Live judging retrospective

It's been a few years since we started working on live judging for yoyo contests, together with Slusny and CYA. Since we didn't really share too much about it publicly and there has been some questions about that on last CYA general meeting, I figured it'd be nice to summarize what has happened so far.

Early attempts 🔗

The idea is quite old. Displaying results live during the contest is quite common in sports, so naturally there was some interest in trying to do the same thing for yoyoing.

Some early attempts happened at fun contests where the stakes are not so high, e.g. on yoyocamp. This was usually just about filling the sheet in realtime and showing the results with a projector, or some "on-paper" method with evaluation-only system, where someone would manually go update the results on some board.

However, this already showed some problems with such an approach. Current system normalizes technical scores based on highest click score. Highest score is given 60 points and all other scores are scaled by the corresponding factor. If you show results from our system before the contest ends, they are quite confusing to look at because the scores for all players keep changing during the whole contest and the table with results keeps shuffling. More specifically, everytime some player beats the technical maximum, all other technical scores have to be recomputed.

First serious attempt 🔗

We started working on this seriously for CYYN 2018, and we went all in. We wanted to do everything in realtime, so Kuba Kroutil from Slusny modified our mechanical clickers with eletronic switch and plugged them with a cable to Raspberry PI (tiny computer) and I implemented a simple web application which was running on it, showing results in realtime web UI. We could even show the results changing in realtime during the freestyle, which was pretty cool. Judges used the clickers and then confirmed the scores in web UI on their phones. Kuba even 3D printed a case with CYA logo on it!

Electronic clicking device

How to fix normalization? 🔗

Big question was how to solve the problem with normalization. This was something we wanted to address because early attempts showed that with the current system and results constantly shuffling, this just wouldn't work.

We debated this quite a lot. It was clear that we will somehow have to modify the current system to not depend on future scores.

One option would be to not normalize at all. I think some form of that would be interesting, because normalization creates a bunch of other problems and makes the results more difficult to interpret for yoyo players. But for that, we would need to make sure that every judge clicks roughly the same way, which is even more difficult problem than what we had at first.

We also didn't want to change the current system too much, because people practice for the old system, so we shouldn't just change it to something completely different out of the blue.

Normalization before the contest? 🔗

In the end, we decided to go with slightly modified system which works around the limitation. The problem is that we can't normalize the results until we know the highest technical score for each judge. This highest score gives us the "coefficient" for each judge that we then use to multiply all their scores.

So what if we instead tried to measure this coefficient beforehand? We could let each judge click a few freestyles of similar level before the contest, compute the coefficient from that and then use the coefficient for the new contest. With this scheme, tech scores can be computed right after the freestyle, and then they never change again, which is what we wanted.

There's a lot more to it, though. If you are interested in more details, I wrote more about it in the second appendix below.

Going live on CYYN 2018 🔗

We tried this system on a smaller contest (HYC 2018) and it worked pretty well, so we decided to polish everything and use it for CYYN 2018. This was pretty stressful experience for me, I remember finishing last bits of the system at almost 4AM the night before. It was pretty crazy. The whole system felt just incredibly hacked together, and I was worried the whole time that something critical breaks and this will be a giant failure.

In the end it turned out really great. Live judging was nicely integrated into the whole contest. Updated results were shown on stage after each freestyle, everybody knew their score right after the judges confirmed it. Vashek was a big proponent of the system, so as a speaker together with Shpek, they both made sure to show and discuss the results on stage after each freestyle. This made live judging a key part of the contest. I think we couldn't wish for a better result, this contest has shown very well how live judging can work at its best. If you want to get a sense of the atmosphere, I recommend you to watch a few minutes from finals live stream on fb.

In retrospect, I'm really glad we did this, because it gave us huge amount od insights, even though I think we risked way too much, and I wouldn't do it the same way today. There was a lot of technical challenges and many things could go wrong anytime. We had problems with clickers and with the web app. Judges had to connect to a special Wi-Fi to access the system - sometimes that didn't work, some phones couldn't connect to it. We had to manually kick out people to make sure the Wi-Fi is always operable. There was a lot of potential failure points during the judging process - there was no access control so people could accidentally overwrite each other's scores. Each clicker pair was also connected to a specific judge, so we had to match judges with their clickers and that was sometimes problematic, especially when some clickers stopped working, and we had to replace them, you had to remember where to plug it and where to set it up in the system. Really just a lot of little things you had to do correctly otherwise something broke. Looking back, It feels like a miracle that nothing serious happened that day. We also didn't anticipate how much more time this will take, so the contest ended up being super late.

As for the normalization scheme, it was a mixed result. In the end, it turned out that our measurements weren't exactly perfect. A big part of it was that people just judge differently on the contest than on a video before the contest. In the end, the effect was small and results were not that different from what we would get with the old system, but I still felt that we would need a lot more work to make the system reliable enough.

An epic failure 🔗

We took our lessons and used this system on next small contest in Brno. But sadly, what didn't happen on CYYN, happened here. So many things failed that we had to stop judging live, write scores on paper and somehow hack them together without the system. This also taught us that it's really good idea to have a backup and not rely on the electronic system too much.

Shadow judging on EYYC 2019 🔗

EYYC 2019 happened in Ostrava. For this contest, I rewrote the system a bit to address some problems. Previously, we did everything on the Raspberry, which was a bit difficult to work with, because whenever something failed, I had to connect to it and dig the scores from it somehow.

To be quite honest, I rewrote the system so many times that I don't even remember when we used which version, but I remember that we either had an external Wi-Fi to which everybody connected (even the Raspberry, which served the webpage) or we had the Wi-Fi on raspberry, to which everybody connected. Either way, there was some kind of local private network which everybody connected to, and we gave everybody a link (an ip address) to the webpage where they filled the scores.

The change on this contest was that I ran the system on my computer instead and let everybody connect to my hotspot in there. One problem with previous setup was that judges accidentally stayed connected to the Wi-Fi for too long, even after their division ended, so new judges that wanted to connect couldn't. Here I had the freedom to limit the number of users and kick them out if they were not supposed to be connected. Phones connect to known Wi-Fis automatically, so we changed the Wi-Fi name and password to match the division, and gave the password only to judges that were supposed to be connected. This also limited the amount of people who could interrupt the system, or accidentally change scores.

Because our new normalization schema wasn't such a big success, the electronic system wasn't so stable yet, and we didn't want to risk too much on EYYC, we decided to not use live judging here and only used the system to collect scores.

Big promise of the electronic system is that it reduces the amount of work needed after the scores are in - now the scores are already in computer, so we don't need to transcribe them manually. We didn't trust the system as much, so we've also let judges write the scores on paper, to have a backup in case something goes wrong. When the scores are in, we can just check them against the paper sheet, which is much faster than putting them to a computer manually in the first place.

Big shift 🔗

EYYC 2019 was the last contest before a pretty big shift in the way we did things. There were two things we struggled with constantly:

We had a ton of problems with electronic clickers. They were pretty unreliable, sometimes they overclicked or underclicked. Sometimes you finished a freestyle and there was a 200-click difference between mechanical clicker and the value in the system. Even though we bought a bunch of high quality cables, they kept breaking. During this EYYC, almost all electronic clickers stopped working eventually.
Another frustrating problem was with the custom Wi-Fi network. The reason we initially did it this way was because we wanted the system to be operating even without the internet, because on contest places, internet can be pretty flaky, and we didn't want to depend on it. It turned out though, that this system was problematic in a bunch of different ways.
1. As I mentioned, judges often overcrowded the Wi-Fi, sometimes they couldn't connect or couldn't even find it.
2. It wasn't helping that this Wi-Fi didn't have internet, because then judges kept switching between Wi-Fis to actually access the internet for other purposes and increased the likelihood that they won't be able to connect next time.
3. Phones try to be smart about Wi-Fi, so some of them don't even connect to endpoints without internet. Similarly, they keep switching to other endpoints automatically if they find them better (stronger signal or better internet).
4. Last but not least, running the system on my computer wasn't exactly ideal either. Now instead of depending on raspberry being operating, we relied on my computer to running correctly and serving the HotSpot when we need it to.

This system just created a ton of failure points that we were constantly fighting, instead of focusing on the contest.

Making a better system 🔗

With all these lessons in mind, we decided to rewrite the system again with the following properties:

No electronic clickers

This was the saddest part to me, but we had to do it. Clickers were unreliable, and we didn't need them, because we were already showing only the final score anyway. We tried to show scores during the freestyle on the first contest, but quickly abandoned it because it wasn't interesting. I wrote more about why in the Appendix A.

Just put the thing on the internet

Without electronic clickers, we didn't need the Raspberry, and I already rewrote the system to be runnable on my computer, so we could just put it on a server on the internet. Turns out that instead of trying to make the no-internet setup work, it's much easier to ensure the internet is available - the worst case is that you provide hotspot for judges.

Show only evaluations and major deductions live

To show tech score live in a meaningful way, we would have to work more on the new judging system which avoids normalization. This adds overhead and a lot of potential problems. It also makes the system much more difficult to adopt for other contests. We decided to use the old system and show live only categories which are not normalized. We then make full results with the tech score public at the award ceremony. This gives us some time to check and fix potential problems in case something goes wrong. We didn't have this option with full live system.

This is also a good middle ground for people - some players didn't like that with live judging, we loose the surprise at the award ceremony. Even though I consider this critique to be on the same level as criticising iPhone for not having buttons, we can have both now, which is nice. We have live evaluations, so we get rough idea of the results, but we get the full results in the end.

I think we should have started with this in the first place. Those are much simpler options than what we had started with, and it would allow us to focus on more important things, like how to present the scores and results to people on the contest or how to make judging seamless and less error-prone.

Current state 🔗

Since I wrote this new system, we have been using it on almost every contest, and it became standard part of the process. Even if we don't show the results, we still use it to collect scores. There are still some rough edges, but every contest I try to improve it a little. Overall I think this approach paid off and this system is way more reliable than the previous ones.

Technically, it's pretty standard website now. It lives on yoyojudge.com, many of you have probably seen it already. Especially after CYYN 2021, where we started sharing links to live results online, so now you can look at the live results even on your phone.

That said, I don't think we've used the system to its full potential, yet. Part of it is just Covid, a lot of big contests were cancelled. On most other contests it was more in a testing mode and either not visible, because we didn't have a projector, or it was somewhere a bit hidden and people didn't pay attention to it very much. Only after we made the results accessible online on CYYN 2021, it got a bit better. We also had a TV on the side, where it was visible.

I am curious how much people paid attention to it or how much they accessed the online version. I've seen Ann mention it in her post-contest video, which was really nice to see, because we haven't got much feedback since CYYN 2018, except for feedback from judges who use it a lot. It's a good overview of how the system works and how do people use it, I recommend checking it out.

Overall I am happy with how it works now, but I'd like to use it in a more fundamental way, and make it a key part of the contest experience again. I believe we haven't done that properly since CYYN 2018.

This system has its own set of problems too, though.

We rely on judges having their phones well usable. Judges need to access the website on the internet. This usually means that they have to have their phone, it needs to be charged, they probably need a charger, and we have to provide them with outlets. We already had a problem with this a few times, when someone forgot their phone at home or had it unusable for other reasons. This is still much better than what we had before, but sometimes still problematic.
The system ended up too complicated for what we need. Too many things are configurable, and that makes it more difficult to use for admins who set it up (usually me). There's also some technical debt, which makes it hard to change now. This is not ideal, but fixable over time.
The current system is more difficult for judges. At the moment, we don't trust the live judging system enough to get rid of the paper sheets. This makes the process a bit tedious, because judging is already low on time, especially evaluations. As a judge, you now have to quickly decide on your numbers, unlock your phone, type in the numbers, confirm and write them on paper, all that in a noisy environment while speaker is bugging you with "judges ready?". No time to make a mistake really.
It doesn't interface well with the other things needed for the contest - for example, we can't print sheets and orderings from it, so we still need some manual steps and keep the data synchronized with some spreadsheet elsewhere. Again, fixable, but needs more work.

These are more like paper cuts at this point, though, and the system is generally in a pretty good shape in my opinion.

Conclusions 🔗

Some key takeaways from all of this:

If you want to do live judging, it really has to be well integrated into the contest. This is extremely important, way more than I first expected.

You can't expect people to care if you put some table on a screen on the side. It really has to be paid attention to. It must be well visible, make sense to people and speaker has to talk about it, bring some attention to it and explain what it means. This makes a huge difference. Even after 4 years, the best this has ever worked was on CYYN 2018, because we really made sure it was a big part of the whole thing.

Simpler judging system would work better for live results. Especially evaluations in the official IYYF system are way too vague and there's too many of them, so they are hard to grasp in a few seconds after each freestyle. As a speaker or spectator, you either have to treat them as "just numbers" or focus on a single value, otherwise you don't get much out of it in such a short time. This is better if the results are somewhere on the side or on the internet and people can study them in detail asynchronously. Making evaluations simpler also helped, we did that a few years ago.
Results are done faster if you use electronic system to collect scores. This is a big improvement and the reason why we use the system on every contest, even if we don't show the results live at all.
Fully live results are hard. Live means no time to fix problems, everything has to work perfectly. On top of that, our judging system is problematic, so we can't use it as is.
Fully live system with electronic clickers and no-internet setup is even harder. We should have started with something much simpler and focus on the presentation. We can always go "more live" later. I don't think we even needed to build a custom system for this at first, we could have started with a shared spreadsheet and go from there.
Even though the system is automatic, it doesn't make administration much simpler. I have more thoughts on this in the Appendix C

Overall, this was a great experience, even though it was very difficult sometimes, and I am glad we went through this. We learned a lot, and I hope we will improve the contest experience with it even more in the future.

There's a lot more to talk about, but this is already too long, so I'll end it here. I have left some more detailed thoughts below, if you're interested. I'll probably write about related things in the future, too.

Appendix A: Are electronic clickers worth it? 🔗

Even though I was very excited about having all the data from electronic clickers, namely timestamp for each click, it turned out so problematic that we hardly ever used it. At first, we thought we could use this data to update the player's position in the results during the freestyle, but that didn't really work for a few reasons:

People don't look at the table, they look at the freestyle
Turns out that player's position in the results is not that interesting during the freestyle. The position is usually very low, because the player doesn't have eval scores yet. But even if you don't count evals and only compare by tech, usually the player spends the vast majority of the time on some low spot that is nowhere near his final position and there isn't really anything that interesting to look at. Only in last 15 or so seconds, the player makes a big leap to their final position
Another part of it is that it's hard to keep track of both the freestyle and the table and somehow meaningfully connect them to get something out of it. You'd have to look at both at the same time and keep track of which move corresponds to which score change, which is delayed. And as mentioned in the point above, the vast majority of the time, you just see a number at the bottom of the table, and the "interesting jump" to a higher position almost never happens.
The clickers often didn't work reliably, so the data wasn't that useful anyway, because it contained a lot of noise. This could be fixed with more work, though.
And at last, again, to use it requires us to change our judging system in a pretty complicated way, and I don't think it's worth doing that.

The only glimpse of the data is in a few CYYN 2018 freestyles, but even there, it hardly provides any useful insight, even though it looks cool. See for example Michel Malík's winning freestyle

To be honest, I was a bit disappointed with this result, because I thought it'd be amazing to get all this new knowledge about clicks. I still believe there's a lot of potential, but I think we would need to work hard to get something out of it and figure out how to present this data in an interesting way.

For a motivating example, have you ever watched a freestyle where the music was so quiet that you could hear the clickers? And when the player hits a banger, you suddenly hear this avalanche of click sounds. That's something really cool to me, and I think it proves that this data have some potential to enhance the perception of the freestyle.

One incredibly valuable thing about this is that you get immediate feedback on which tricks score on per trick basis. This could make the contest more interesting for less experienced spectators, who otherwise have no way to understand how much is each trick worth. But it could also be valuable feedback for competitors, who usually have to guess this only from the final score. ...And yoyo players in freestyles sure do a lot of tricks that don't score these days, so they could use something like that.

By the way, one interesting problem was that it's kinda hard to match clicks with the freestyle. After the contest, you just have a bunch of timestamps saved, but you don't know when exactly each freestyle started - that is not saved anywhere. The correct timestamp for that is when music is started, but this is not connected to the system at all, so we always had to guess it somehow.

Appendix B: Does the new normalization scheme really work? 🔗

The scheme relies on the premise that judges don't change too much how they click and that the contest skill level will roughly be the same as the contest we used to measure the values.

The downside is that now everything depends on this coefficient - if the coefficient is somehow wrong, our results are also wrong. Another downside is that tech/eval ratio is no longer fixed, because tech is no longer normalized to have a max value - it's only roughly scaled to match the ratio we want, but in principle it's not bounded.

Another downside is that this method is not easily portable between contests of different skill level in the same way the current contest is - either you have to accept that tech/eval ratio is different, or pick a different coefficient for each contest.

A big problem with picking a coefficient for tech/eval ratio is that it's basically a judgement call - there's no clear method for how to pick which coefficient you use, you have to eyeball it based on how big numbers do you expect. This is relatively easy for popular common categories that don't change much over time, like Czech 1A, but it's much more difficult for small categories, like Junior 1A or 4A, because there isn't too much data and skill level changes by a big factor just based on who attends. Trying to tune this new system with what current normalization scheme does is very difficult. Combined categories like X division are borderline impossible.

For all this sea of problems, though, we get a very nice property in this system - Once you get your score after the freestyle, it never changes during the whole contest. This is makes the results much less confusing to look at and enables players to reason more clearly about their scores during the contest. Now your position in the final table isn't randomly shuffled after each freestyle.

As I mentioned in the main article, some coefficients turned out to be different on the contest, compared to what we measured before the contest. This meant that some judges had a higher weight on the final scores than others. Thankfully not that much. Another part was the judgement call I mentioned earlier. In retrospect, I think I picked too low scaling factor for the tech/eval ratio, which resulted in evaluations having a bigger impact on the final score. I tried to run the numbers through the old system and didn't get much different results though. This was during Michael Malík streak and his score was large in almost every category, so there was no doubt that he won in any system. People in lower places were a bit mixed up but overall the difference was not huge.

Other categories felt ok, but sometimes the results were just bit too weird - some too low, some too high. The sample size for pre-contest scores was so small that we can't expect to get a good coefficient for every specific combination of judges, players and styles. Thankfully, many of these divisions have a big differences between players, so the results are also not affected by these choices that much.

There's also this interesting effect that happens in the system, that changing ratio of different result categories doesn't seem to affect results too much. Often you find out that even drastically changing tech/eval ratio doesn't seem to change the results very much, sometimes it doesn't change them at all. This gives a false impression that some categories are useless in the system, but I think this is a natural effect that happens over time as people optimize for the system. It'd be interesting to explore this effect a bit more.

Appendix C: Live judging makes the administration more complicated in other ways 🔗

One promise of the electronic system was that it should make the administration on the contest easier. If you haven't had some experience with it, it might be pretty surprising to find out that there is a lot of judging/results related work that has to happen on the contest. So much so that there's usually at least one person who is doing this work almost non-stop for the whole duration of the contest (ever wondered why Sheda or Josefína were never seen much on Czech nationals or EYYC?).

Big part of it was always transferring numbers from sheets to the computer. This is now basically done immediately, so we only have to check if the numbers in the computer match the numbers in the paper sheet.

The problem is that live judging also adds a bunch of other steps to the process. While in the past, it was enough to just print the sheets for judges, now you also need to prepare the live judging system, setup players, rules and judges in it, send them invite links and then troubleshoot every problem they might have with it.

Also, with paper-only, it is easy to replace a judge if someone is missing (which happens surprisingly often). You just give the sheet to someone else. But with live judging, you need to re-setup the system to add a new judge or give the new judge access that belongs to the missing judge. This is easier but creates a mess in the system and makes the data kindof worthless, because now there are some scores in the system under different name.

This whole process used to be pretty well accessible to whoever could do it, but now it has become quite alien. To do it efficiently, you have to have access to a specific system as an admin, learn to set it up properly, and ideally you also need some of my custom tools to make administration easier. I am pretty much the only one who can do this effectively now. I'll have to take some time to make this process more accessible to other people (and automate more of it).

18 April 2022

#yoyo