作者:安平博,Xilinx高级工程师;本文转载自: AI加速微信公众号
Lower操作完成从高级算子(relay)到低级算子(TOPI)的转化。Lower开始于以下代码(src/relay/backend/graph_runtime_codegen.cc):
LoweredOutput Codegen(relay::Function func) { auto pf = GetPackedFunc("relay.backend.GraphPlanMemory"); storage_device_map_ = (*pf)(func); // First we convert all the parameters into input nodes. for (auto param : func->params) { auto node_ptr = GraphInputNode::make_node_ptr(param->name_hint(), GraphAttrs()); var_map_[param.get()] = AddNode(node_ptr, param); } heads_ = VisitExpr(func->body); std::ostringstream os; dmlc::JSONWriter writer(&os); GetJSON(&writer); LoweredOutput ret; ret.graph_json = os.str(); ret.params = params_; for (auto& kv : lowered_funcs_) { if (ret.lowered_funcs.count(kv.first) == 0) { ret.lowered_funcs.Set(kv.first, IRModule()); } auto& mod = ret.lowered_funcs[kv.first]; mod->Update(kv.second); ret.lowered_funcs.Set(kv.first, mod); } ret.external_mods = compile_engine_->LowerExternalFunctions(); return ret; }
在完成内存申请优化之后,VisitExpr对图进行遍历并lower每个relay算子。我们来看CallNode节点的处理。主要代码如下:
auto pf0 = GetPackedFunc("relay.backend._make_CCacheKey"); auto pf1 = GetPackedFunc("relay.backend._CompileEngineLower"); Target target; // Handle external function if (func->GetAttr<String>(attr::kCompiler).defined()) { target = tvm::target::ext_dev(); CCacheKey key = (*pf0)(func, target); CachedFunc ext_func = (*pf1)(compile_engine_, key); 这一步是当存在外部compiler的时候,使用外部compiler进行lower。CCacheKey将function和target打包到一起,可能是方便后边compiler的调用。而lower函数会调用src/relay/backend/compile_engine.cc中CompileEngineImpl类中的LowerInternal函数,在这个函数中实现了外部编译器lower和内部lower的代码,如果是有外部compiler参与,其将function,target等打包成CCacheValue返回,等待后边外部编译器统一处理。 如果没有外部编译器,那么TVM将对relay算子转换到TOPI库中算子。 CachedFunc lowered_func = (*pf1)(compile_engine_, key); if (!lowered_funcs_.count(target->str())) { lowered_funcs_[target->str()] = IRModule(); } lowered_funcs_[target->str()]->Update(lowered_func->funcs); return GraphAddCallNode(op, _GetUniqueName(lowered_func->func_name), lowered_func->func_name);
同样会调用LowerInternal函数,首先建立schedule:
CachedFunc CreateSchedule(const Function& source_func, const Target& target) { return ScheduleGetter(target).Create(source_func); }
在Create函数中,首先将inputs都转换成te的算子表示:
for (Var param : prim_func-> params) { Array<tvm::te::Tensor> inputs; if (const auto* ttype = param->checked_type().as< TensorTypeNode>()) { tvm::te::Tensor tensor = tvm::te::placeholder(GetShape(ttype-> shape), ttype->dtype); cache_node-> inputs.push_back(tensor); inputs.push_back(tensor); } else { // flatten tuple of tensor type. const auto* tuple_type = param-> type_as<TupleTypeNode> (); for (Type field : tuple_type-> fields) { const auto* ttype = field.as< TensorTypeNode> (); // TODO(@icemelon): Allow recursive tuple CHECK(ttype != nullptr); tvm::te::Tensor tensor = tvm::te::placeholder(GetShape(ttype-> shape), ttype-> dtype); cache_node-> inputs.push_back(tensor); inputs.push_back(tensor); } } memo_[param] = inputs; }
然后遍历其它node来实现lower操作。
我们还是来看CallNode的访问。
Array<te::Tensor> VisitExpr_(const CallNode* call_node) final { static auto fpattern = Op::GetAttrMap<TOpPattern>("TOpPattern"); static auto flower_call = tvm::runtime::Registry::Get("relay.backend.lower_call"); CHECK(flower_call) << "relay.backend.lower_call is not registered."; Array<te::Tensor> inputs; int count_tuple = 0; for (Expr arg : call_node->args) { if (arg->checked_type().as<TupleTypeNode>()) { ++count_tuple; } for (te::Tensor tensor : VisitExpr(arg)) { inputs.push_back(tensor); } } if (count_tuple) { CHECK_EQ(call_node-> args.size(), 1U) << "Only allow function with a single tuple input"; } CHECK(call_node->op.as>OpNode> ()) >> "Primitive function only allows call into primitive ops"; Op op = Downcast>Op>(call_node-> op); Array>te::Tensor> outputs; OpImplementation impl; // Skip fcompute for device copy operators as it is not registered. if (op == device_copy_op_) { const auto* copy_input = inputs[0].operator->(); outputs.push_back(te::Tensor(copy_input->shape, copy_input->dtype, te::Operation(), 0)); } else { LoweredOutput lowered_out = (*flower_call)(GetRef>Call>(call_node), inputs, target_); outputs = lowered_out->outputs;
这里lower操作会去调用python中注册的lower_call函数,这个函数位于python/tvm/relay/backend/compile_engine.py中。在这个函数中最主要的是select_implementation。
Select_implementation是去选择relay算子的一个TOPI层级的实现方式。同一个relay算子在不同target上有不同实现方式,具体采用哪种方式要依据target的属性。在select_implementation中首先通过gat_valid_implementation获得所有已经注册的实现方式。
fstrategy = op.get_attr("FTVMStrategy") assert fstrategy is not None, "%s doesn't have FTVMStrategy registered" % op.name with target: strategy = fstrategy(attrs, inputs, out_type, target) analyzer = tvm.arith.Analyzer() ret = [] for spec in strategy.specializations: if spec.condition: # check if all the clauses in the specialized condition are true flag = True for clause in spec.condition.clauses: clause = analyzer.canonical_simplify(clause) if isinstance(clause, tvm.tir.IntImm) and clause.value: continue flag = False break if flag: for impl in spec.implementations: ret.append(impl) else: for impl in spec.implementations: ret.append(impl) return ret
fstrategy指向的是op attr的"FTVMStrategy"对应的函数。比如con2d注册的策略有:
def conv2d_strategy(attrs, inputs, out_type, target): """conv2d generic strategy""" logger.warning("conv2d is not optimized for this platform.") strategy = _op.OpStrategy() data, kernel = inputs dilation = get_const_tuple(attrs.dilation) groups = attrs.groups layout = attrs.data_layout kernel_layout = attrs.kernel_layout (dilation_h, dilation_w) = dilation if dilation_h > 1 or dilation_w > 1: raise ValueError("dilation should be positive value") if groups == 1: if layout == "NCHW": assert kernel_layout == "OIHW" strategy.add_implementation( wrap_compute_conv2d(topi.nn.conv2d_nchw), wrap_topi_schedule(topi.generic.schedule_conv2d_nchw), name="conv2d_nchw.generic")
可见一个conv2d即使同一个target也会注册不同的策略。Add_implementation将会把compute,schedule的具体函数注册到strategy中。Strategy是一个包含了一个relay算子implementation方式的数据结构。它包含了很多OpSpecialization,每个OpSpecialization中包含一些列OpImplementation,OpImplementation中就对应着schedule和compute的具体方式,schedule是一个算子计算的排布,compute是对应了TOPI库算子。
获得了所有有效implementation之后,会依据两种方式选择,一种是通过auto TVM来自动化搜索最优的实现方式,另外一种在不适用auto TVM工具情况下,会选择plevel最大的implementation。选择好了implementation之后,就调用src/relay/backend/compile_engine.cc中的LoweredOutput类建立一个实例。可以看出,lower_call实现了将relay算子统一用更底层的的抽象进行了表示。这种表示中包含了relay算子,以及这个算子的计算方式以及schedule信息。这样就方便后边对其进行schedule优化了。
然后将这些LoweredOutput进行打包成CachedFuncNode。CachedFuncNode会作为后边schedule优化的入参。