Matrix multiplication process is segregated between two separate dies—a memory die and a compute die to achieve low latency and high bandwidth artificial intelligence (AI) processor. The blocked matrix-multiplication scheme maps computations across multiple processor elements (PE) or matrix-multiplication units. The AI architecture for inference and training includes one or more PEs, where each PE includes memory (e.g., ferroelectric (FE) memory, FE-RAM, SRAM, DRAM, MRAM, etc.) to store weights and input/output I/O data. Each PE also includes a ring or mesh interconnect network to couple the PEs for fast access of information.