Two weeks ago, I've wrote the post to explain why it's fundamentally wrong to use Average as relevant parameter.
It's about 20-25 "screens" long, everything is explained in details, link.
Surprisingly, I need to explain it even further.
Grab some coffee, it will be a long post. Again.
Data were taken from the Official Utopian Review Sheet
Date: 14. Feb - 21. Feb 2019
Number of translations: 25, not perfect for everything but more than enough to prove the point
How the scores are given?
The official Questionnaire consists of 6 questions:
- Accuracy
- Number of Mistakes
- How consistent the translation is
- Quality of post itself
- Legibility of translated text
- Number of translated words
What is the distribution of ranks for each category?
This is very much expected, strong grouping around the recommended score "Very Good"
This is the catch that I already explained in my previous post. Distribution can't be normal distribution when you have "the wall" on one side, in this case: No Errors and 1 Error. Teams GER, ITA and ESP had 3 translations (in total) with the Ranks 3 or 4.
How consistent the translations were? One may think that there are two questions, but this is a classical case of underlying factors. Accuracy is correlated with Errors and with Consistency. Very intuitive, just as tall people have long legs and arms as well.
Of course, no surprise for Legibility. It's also correlated
As you can see, we have 4 questions related to "quality" in our questionnaire. In reality, it's a single question.
The only post that was not almost perfectly correlated, was this post. Everything was excellent - but there were 6 mistakes (Rank 4).
Rank 3 or 4 in the Error Category usually means Rank 2 or 3 in Accuracy. I guess there were only typos in this post.
Maybe we should change something concerning the typos ,
?
I personally don't like this system where everything is a mistake, and maybe this case is the perfect example how to lose a lot of points for no real reason.
Quality of post is telling us something interesting: there are two grouping points: Very Good and Sufficient.
Sufficient was only given by the teams: Dutch and Serbia
,
,
,
- maybe we are too harsh to our teams?
Word Count, only Serbian team is occasionally translating 2000+ due to Cyrillic/Latin translations - nothing unusual here.
Short conclusion, there is nothing wrong with Ranks (not Points, Ranks)
Ranks vs Points
The system of translating Ranks into the Points is a bit odd:
- 0 negative point for Rank 1 (Excellent)
- a lot of negative points for Ranks 2-3
This is how 25 scores look like when we calculate the sum of all 4 "quality parameters":
Pay attention! This is not a histogram, this is a bar plot!
I know it's a bit stupid to make a histogram with only 25 cases, but anyway...
There is "a wall" on the left side of distribution for several questions, this is the result that was expected:
The lowest 5 scores were given in: Spanish Team (2x, different translations, different moderators), Italian, German and Arabic.
How word count is distorting the reality
Besides the average score is irrelevant, without the normalization to the word count - it's double irrelevant.
Here is why:
And the same scores normalized to 2000+ words:
Why is this important? Because the differences become smaller, of course:
- Not-Normalized: Average 75.6 , Deviation 8.0 , Median 77.0
- Normalized: Average 83.7 , Deviation 7.5 , Median 85.0
*I know it's wrong to do Average and Deviation, but people like those two parameters for some reason
(Not)Surprisingly, everything is fine once the points are normalized.
Let's do average scores for teams (*there are at least 3 reasons why this is pointless, but anyway, it's a norm to do it)
Let's make it more dramatic. Such an inequality :D
Now, let's normalize the scores to the word count:
As you can see, the perfect equality.
Several translations contained a lot of errors and that's basically it...
In Conclusion
- Always normalize the data
- Don't consider points if the relation to ranks is not linear-ish
- Don't use average if the distribution is not normal distribution
- There is no need to have multiple questions if the answers are highly correlated
The only question that is "controversial" is the distribution of ranks for "quality of post".
Duch and Serbs, maybe we are too harsh:
The difference is only 3 points, so - who cares.
There was also 1 post with all Excellent scores, except the Rank 4 in Errors, a bit unusual.