What's wrong with measuring and evaluating its outputs directly? If it can accurately file taxes better than us does it matter if it does it in a human manner?
If your definition of AGI is filing taxes, then it's fine.
Once we step into any other problem, then you need to measure that other problem as well. Lots of problems are concerned with how an intelligent being could fail. Our society is built on lots of those assumptions.
Birds and planes both fly and all