思科QuantumFlow处理器体系分析: 弯曲评论企业库|免费b2b网站

思科QuantumFlow处理器体系分析: 弯曲评论

作者 | 2010-03-02 17:10 | 类型 , |

我也来说说我的思科QFP体系分析。希望Cisco的兄弟指点指点。

首席说过的一些我在这就不啰嗦了。

0. QFP是一个体系，而不是指单一的芯片。构成一个路由系统(ASR1000)需要cisco的4颗核心ASIC
multi-core packet processor chip
Traffic Manager(BQS)
Cypto
SPA aggregation ASIC

1. QFP ISA与微体系结构方面，Will已经说得很清楚了，Xtensa ISA与Cisco自己定制的微结构。

2. cache & on-chip packet memory
每个PPE拥有自己的
L1 D-cache 4KB 8way，但每个Thread专用2个way
L1 I-cache 16KB 8way。
40个PPE共享两个256KB的L2 I-Cache，L2作为D-Cache没有它的理由，数据包的局部性如何？
Will的报告上写得很明白，首席就是看不出来。
这两个256KB的L2 I-Cache如何组织的呢？难道是其中20个PPE用一个L2 I-Cache，另外的用第二个？
on-chip packet memory应该是1.2MB左右，为什么呢？Will的报告说总共20Mb SRAM，刨掉L1和L2 cache基本上就1.2MB。

3. TLB & cache coherency
TLB除了读写访问控制和地址翻译，还包括memory ordering属性控制。relaxed order和strong order。
支持软件cache coherency操作，比如flush, flush and invalidate…。我猜不支持hardware enforced的cache coherency？貌似支持cache预热或者叫stashing，也就是从crossbar来的消息响应数据能主动放入cache并valid cahce tag。

4. memory模型
weak order模型。但提供barrier, 串行化，原子操作支持。

5. 编程模型
Flat memory program model,这个大大的方便了C编程。Cisco的一个设计目标就是用C编写转发代码，而不是微码。外部RLDRAM通过TLB直接映射到处理器Thread地址空间，on-chip packet memory也可以直接映射到Thread地址空间，通过TLB同样可以映射外部memory作为C代码的栈空间（stack)，同样内部硬件加速器需要的寄存器(Control status registers)可以映射到Thread地址空间。
6. IPC
我猜IPC=1，文中说1200MIPS，PPE{zg}频率是1.2GHz。

7.包处理体系
基于中心share memory的pool型，不是pipeline型。它包括DISTRIBUTOR, on-chip packet memory, PPEs pool(40个处理器core）,lookup engine, TCAM, lock manager and? sequencer, GATHER/DMA, BQS.等等，当然离不开片外memory。
二层整个包对PPE可见，不仅仅是包头。这些硬件资源通过resource interconnect和memory interconnect进行通信。

8.互联体系
核心互联结构是基于crossbar switch的资源互联（resource interconnect)以及memory互联（memory interconnect）。
这个首席描述的不准确。有些memory操作没有必要走中心的resource interconnect，而是通过独立的memory interconnect通道，这样可以减小访存latency。比方说lookup engine和hash engine的memory访问。再比方说，L2 I-Cache也是通过独立的memory访问通道进行访存。这个memory控制器应该是个多端口，多Bank体系。如何保证高的内存bandwith和低的latency是高速网络处理器设计的核心问题之一。所以首席把L2 cache搁到这个crossbar上是不对的。

crossbar资源互联(resource interconnect)至少有DISTRIBUTOR, on-chip-packet memory, PPEs, lock manager, GATHER, memory controller, TCAM controller, lookup engine。

资源互联（resource interconnect)是基于message passing机制的。消息报文包括源、目的地址，命令和数据。通过message request(消息请求包）和message respond（消息响应包）来完成通信。

Will的报告已经告诉我们PPE如何连接到crossbar上。每个PPE通过一个message coprocessor(消息协处理器）和一个buffer与crossbar互联。每线程5个通道。

(没有打分)

“思科QuantumFlow处理器体系分析”有26个回复

楼上楼下，电灯电话 于 2010-03-02 8:46 pm
老刘于 2010-03-02 9:16 pm
Multithreaded 于 2010-03-02 9:27 pm

>>2.40个PPE共享两个256KB的L2 I-Cache，L2作为D-Cache没有它的理由，数据包的局部性如何？
Will的报告上写得很明白，首席就是看不出来。
这两个256KB的L2 I-Cache如何组织的呢？难道是其中20个PPE用一个L2 I-Cache，另外的用第二个？

256KB的L2-Cache用作I-cache显得没必要。一般来说，64KB（16K-Instructions)的I_cache应该满足fast-path的需求。

如果是SPMD的编程模型，40PPE应该用同样的code，没必要划出两个指令空间来。

>>3. Flat memory program model,这个大大的方便了C编程。Cisco的一个设计目标就是用C编写转发代码，而不是微码。外部RLDRAM通过TLB直接映射到处理器Thread地址空间，on-chip packet memory也可以直接映射到Thread地址空间，通过TLB同样可以映射外部memory作为C代码的栈空间（stack)，同样内部硬件加速器需要的寄存器(Control status registers)可以映射到Thread地址空间

这只是Memory-mapped的特点，和栈的支持无关。如果每个Thread都有自己的栈，用TLB是无法解决的。猜一下，请CISCO的人指正。

如果是SPMD的编程模型，每个THREAD在一个连续空间里分配相同大下的栈。每条栈指令应该是只存相对地址，寻址时用（Processor_ID,Thread_ID)计算出高位，即每个Thread栈的starting address,再加上相对地址得出。

可比小看这东西，从硬件设计得角度来说，很容易做。难点在软件上，当初也是花了一定的功夫，才把architecture, programming module, and code generation 搞定的。总之，要想程序员用C编的爽，作系统设计时，编译原理一定要懂。
老刘于 2010-03-02 9:47 pm
Multithreaded 于 2010-03-02 9:48 pm

〉〉0. QFP是一个体系，而不是指单一的芯片。构成一个路由系统(ASR1000)需要cisco的4颗核心ASIC
multi-core packet processor chip
Traffic Manager(BQS)
Cypto
SPA aggregation ASIC

其实我对QFP还是很失望的。在IXP-2800诞生的8年后，还需要4块芯片来搞定10G/20G的东西。

十年前，我们的豪言壮语是用一块芯片搞定上述三块（Cypto除外）达到10G的线速。

革命尚为成功，Cisco还需努力啊。
老刘于 2010-03-02 9:52 pm
老刘于 2010-03-02 9:55 pm
老刘于 2010-03-02 10:01 pm
Multithreaded 于 2010-03-02 10:03 pm

栈也可以用1.2MB的SRAM来做。当然是shallow stack. Let’s say 1K per thread, and in toal it only occupies 40KB.

In general, each PPE only needs 40-Bytes IP/TCP header and the payload should remain in the DDR memory without loading into PPE L1 cache; otherwsie there is no way to guarantee the 10G/20G linerate.
杰克于 2010-03-02 10:37 pm
Multithreaded 于 2010-03-02 10:53 pm

How to do hardwired? For example,

push 0×100 //! push 0×100 onto the stack

is the same for all 40×4=160 threads.

When thread_i executes this instruction, what is its stack address?

Please also note the size of each stack should be also configurable. I.e., it can be 1K, 2K 4K etc.
数通人 于 2010-03-03 2:11 am
陈怀临 于 2010-03-03 2:56 am
老刘于 2010-03-03 3:02 am
帅云霓 于 2010-03-03 4:26 am
陈怀临 于 2010-03-03 4:50 am
ALL IP 于 2010-03-03 4:59 am
老刘于 2010-03-03 5:02 am
陈怀临 于 2010-03-03 5:06 am
老刘于 2010-03-03 5:07 am
老刘于 2010-03-03 5:19 am
数通人 于 2010-03-03 5:22 am
老刘于 2010-03-03 5:24 am
于 2010-03-03 5:37 am
Multithreaded 于 2010-03-03 9:05 am

好像大家还是停留在Single-threaded的编程模式里。

There are total 160 threads and 160 runtime stacks corresponding to each thread.

Question 1: how to determine the starting address of each thread stack. Question 2: how to generate the same code for all 160 threads, i.e., the binary is the same but runtime behaviour is different.
Multithreaded 于 2010-03-03 9:12 am

>>hardwired的堆栈地址能节省线程切换TLB的时间，是一种空间换效率的tradeoff。

NPU thread switching doesn’t need to do TLB update, and it must finish within ONE or a few cycles.

I guess you are thinking of OS level process switching which is very different from NPU type of OS-less type of fast thread switching.

发表评论

郑重声明：资讯【思科QuantumFlow处理器体系分析: 弯曲评论】由发布，版权归原作者及其所在单位，其原创性以及文中陈述文字和内容未经(企业库qiyeku.com)证实，请读者仅作参考，并请自行核实相关内容。若本文有侵犯到您的版权，请你提供相关证明及申请并与我们联系（qiyeku # qq.com）或【在线投诉】，我们审核后将会尽快处理。

—— 相关资讯 ——