I agree with the points you made about reliability, but one also needs to consider validity.
VALIDITY
Do the questions really measure what they are supposed to measure? If you ask irrelevant questions, you will get irrelevant results.
I once proctored a standardized government-approved exam to choose qualified Taiwanese students to study abroad based on their ability to understand spoken French. I'm a native speaker and highly experienced teacher, so I was shocked to hear some of the recorded multiple-choice questions:
(1) What is a rocket?
(2) What is a satellite?
Knowing that many of the students who studied French at the time were heavily into literature and art, I knew that such ridiculous questions would doom many of them to failure. Why on earth should inability to understand technical vocabulary be used to assess students' listening skills?
Unfortunately, my opinion was not solicited, and I was made to understand that criticism was not welcome. I was treated as hired help.
Based on some of the complaints that were made by test takers, it is obvious that there was indeed a problem with validity:
“However, the questions barely tested our substantive knowledge on the subject. They were mostly questions about procedure that you wouldn’t bother studying because the information can be easily looked up.”
Maltese Professionals Lament Unfair Treatment In Crypto Agents Licensing Exam
However, as you point out, this is a beginning (it's better than nothing) and they will probably improve their techniques as time goes by.
RE: Failing A Crypto Exam: An Educator's Perspective