Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000