How would you know if the box can be subjected to controlled tests? That assumes that the box will respond in the same way every time we act upon in some way. That is true for most boxes we encounter, but maybe not all.
As you yourself admitted, the only justification you have for this statement is that you are assessing other methods by the standards of science, which could be completely irrelevant if the matter being studied is not amenable to science. In other words, your statement is not a rational conclusion, but an axiom of your epistemological system. It would be begging the question to use this as an argument for why other methods are less reliable than science.
Firstly, your assumptions do inherently rule out certain types of objects, behaviors of objects, or scenarios which are not amenable to controlled scientific testing.
Secondly, that “proven track record” is only such based on your starting assumptions.
Thirdly, there could be boxes or other objects which are very different in nature from what science has been successful at studying. Thus science’s track record means little.
Usefulness here is again defined with respect to what science is interested in. So again, your definitions of reliability, usefulness - are all conveniently chosen to satisfy your starting assumption of science as the standard by which all methods of knowing must be judged.
I agree with you that there are certain scenarios where you can directly compare the results of knowing via scientific method with other methods. But there are other scenarios where it doesn’t make sense to compare them (such as knowing whether it is morally right or wrong to do X),