DOI: 10.1145/1816038.1815993. A dynamically configurable coprocessor for convolutional neural networks

A dynamically configurable coprocessor for convolutional neural networks

10.1145/1816038.1815993

Crossref journal-article

Association for Computing Machinery (ACM)

ACM SIGARCH Computer Architecture News (320)

Abstract

Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural language processing applications. Two key observations drive the design of a new architecture for CNN. First, CNN workloads exhibit a widely varying mix of three types of parallelism : parallelism within a convolution operation, intra-output parallelism where multiple input sources (features) are combined to create a single output, and inter-output parallelism where multiple, independent outputs (features) are computed simultaneously. Workloads differ significantly across different CNN applications, and across different layers of a CNN. Second, the number of processing elements in an architecture continues to scale (as per Moore's law) much faster than off-chip memory bandwidth (or pin-count) of chips. Based on these two observations, we show that for a given number of processing elements and off-chip memory bandwidth, a new CNN hardware architecture that dynamically configures the hardware on-the-fly to match the specific mix of parallelism in a given workload gives the best throughput performance. Our CNN compiler automatically translates high abstraction network specification into a parallel microprogram (a sequence of low-level VLIW instructions) that is mapped, scheduled and executed by the coprocessor. Compared to a 2.3 GHz quad-core, dual socket Intel Xeon, 1.35 GHz C870 GPU, and a 200 MHz FPGA implementation, our 120 MHz dynamically configurable architecture is 4x to 8x faster. This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.

Bibliography

Chakradhar, S., Sankaradas, M., Jakkula, V., & Cadambi, S. (2010). A dynamically configurable coprocessor for convolutional neural networks. ACM SIGARCH Computer Architecture News, 38(3), 247â257.

Authors 4

Srimat Chakradhar (first)
Murugan Sankaradas (additional)
Venkata Jakkula (additional)
Srihari Cadambi (additional)

References 30 Referenced 132

10.1109/5.726791
10.1145/1390156.1390177
Benkrid , K. ; Belkacemi , S. , " Design and implementation of a 2D convolution core for video applications on FPGAs," Digital and Computational Video, 2002 . DCV 2002. Proceedings. Third International Workshop on , pp. 85 -- 92 , 14--15 Nov. 2002 . Benkrid, K.; Belkacemi, S., "Design and implementation of a 2D convolution core for video applications on FPGAs," Digital and Computational Video, 2002. DCV 2002. Proceedings. Third International Workshop on, pp. 85--92, 14--15 Nov. 2002. / DCV 2002. Proceedings. Third International Workshop on by Benkrid K. (2002)
10.1109/TCSII.2005.857091
10.1109/TCSII.2006.886898
10.1109/TNN.2006.883002
10.1007/s11265-005-4961-3
10.1145/1390156.1390170
10.1109/FPL.2009.5272559
Dixon , J. D. ( 1981 ). Asymptotically fast factorization of integers. Math. Comput., 36, 255--260 . Dixon, J. D. (1981). Asymptotically fast factorization of integers. Math. Comput., 36, 255--260. / Asymptotically fast factorization of integers. Math. Comput., 36, 255--260 by Dixon J. D. (1981)
10.5555/1507435.1507438
Haykin , S. ( 2008 ). Neural networks and learning machines . Prentice Hall . Haykin, S. (2008). Neural networks and learning machines. Prentice Hall. / Neural networks and learning machines by Haykin S. (2008)
10.1109/VLSIC.2005.1469371
Lisboa , P. , Ifeachor , E. , & Szczepaniak , P. ( 2009 ). Artificial neural networks in Biomedicine . Springer Lisboa, P., Ifeachor, E., & Szczepaniak, P. (2009). Artificial neural networks in Biomedicine. Springer / Artificial neural networks in Biomedicine by Lisboa P. (2009)
McNelis , P. D. ( 2005 ). Neural Networks in Finance: Gaining Predictive Edge in the Market . Academic Press . McNelis, P. D. (2005). Neural Networks in Finance: Gaining Predictive Edge in the Market. Academic Press. / Neural Networks in Finance: Gaining Predictive Edge in the Market by McNelis P. D. (2005)
10.1109/MLSP.2008.4685487
10.1109/CVPR.2006.200
10.1109/ISSCC.2006.1696216
Nichols , K. , Moussa , M. , & Areibi , S. ( 2002 ). Feasibility of floating-point arithmetic in FPGA based artificial neural networks . Proceedings of the 15th International Conference on Computer Applications in Industry and Engineering . San Diego, California Nichols, K., Moussa, M., & Areibi, S. (2002). Feasibility of floating-point arithmetic in FPGA based artificial neural networks. Proceedings of the 15th International Conference on Computer Applications in Industry and Engineering. San Diego, California / Proceedings of the 15th International Conference on Computer Applications in Industry and Engineering by Nichols K. (2002)
Nomura , O. , & Morie , T. ( 2007 ). Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital approach . International Conference on Neural Information Processing (pp. 1081--1090) . Springer-Verlag. Nomura, O., & Morie, T. (2007). Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital approach. International Conference on Neural Information Processing (pp. 1081--1090). Springer-Verlag. / International Conference on Neural Information Processing (pp. 1081--1090) by Nomura O. (2007)
Omondi , A. , & Rajapakse , J. ( 2006 ). FPGA Implementations of Neural Networks . Springer . Omondi, A., & Rajapakse, J. (2006). FPGA Implementations of Neural Networks. Springer. (10.1007/0-387-28487-7) / FPGA Implementations of Neural Networks by Omondi A. (2006)
Prasad , B. , & Prasanna , S. ( 2008 ). Speech , Audio, Image and Biomedical Signal Processing using Neural Networks . Springer . Prasad, B., & Prasanna, S. (2008). Speech, Audio, Image and Biomedical Signal Processing using Neural Networks. Springer. (10.1007/978-3-540-75398-8) / Audio, Image and Biomedical Signal Processing using Neural Networks by Prasad B. (2008)
10.5555/1484785.1484787
Wolf , D. F. , Romero , R. A. , & Marques , E. ( 2001 ). Using embedded processors in hardware models of artificial neural networks . Proceedings of SBAI - Simposio Brasileiro de Automao Inteligente, (pp. 78--83) . Wolf, D. F., Romero, R. A., & Marques, E. (2001). Using embedded processors in hardware models of artificial neural networks. Proceedings of SBAI - Simposio Brasileiro de Automao Inteligente, (pp. 78--83). / Proceedings of SBAI - Simposio Brasileiro de Automao Inteligente, (pp. 78--83) by Wolf D. F. (2001)
10.1109/72.554195
10.1007/978-3-642-03767-2_10
10.1109/CVPR.2005.254
10.1109/CVPR.2005.177
10.1145/1553374.1553486
10.1145/1553374.1553453

Dates

Type	When
Created	12 years, 10 months ago (Oct. 11, 2012, 10:55 a.m.)
Deposited	2 months, 1 week ago (June 18, 2025, 7:22 a.m.)
Indexed	1 month, 3 weeks ago (July 5, 2025, 12:25 a.m.)
Issued	15 years, 2 months ago (June 19, 2010)
Published	15 years, 2 months ago (June 19, 2010)
Published Online	15 years, 2 months ago (June 19, 2010)
Published Print	15 years, 2 months ago (June 19, 2010)

Funders 0

None

BibTeX

@article{Chakradhar_2010, title={A dynamically configurable coprocessor for convolutional neural networks}, volume={38}, ISSN={0163-5964}, url={http://dx.doi.org/10.1145/1816038.1815993}, DOI={10.1145/1816038.1815993}, number={3}, journal={ACM SIGARCH Computer Architecture News}, publisher={Association for Computing Machinery (ACM)}, author={Chakradhar, Srimat and Sankaradas, Murugan and Jakkula, Venkata and Cadambi, Srihari}, year={2010}, month=jun, pages={247–257} }

JSON

{
  "indexed": {
    "date-parts": [
      [
        2025,
        7,
        5
      ]
    ],
    "date-time": "2025-07-05T04:25:10Z",
    "timestamp": 1751689510062,
    "version": "3.41.0"
  },
  "reference-count": 30,
  "publisher": "Association for Computing Machinery (ACM)",
  "issue": "3",
  "license": [
    {
      "start": {
        "date-parts": [
          [
            2010,
            6,
            19
          ]
        ],
        "date-time": "2010-06-19T00:00:00Z",
        "timestamp": 1276905600000
      },
      "content-version": "vor",
      "delay-in-days": 0,
      "URL": "https://www.acm.org/publications/policies/copyright_policy#Background"
    }
  ],
  "content-domain": {
    "domain": [
      "dl.acm.org"
    ],
    "crossmark-restriction": true
  },
  "published-print": {
    "date-parts": [
      [
        2010,
        6,
        19
      ]
    ]
  },
  "abstract": "<jats:p>\n            Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural language processing applications. Two key observations drive the design of a new architecture for CNN. First, CNN workloads exhibit a\n            <jats:italic>widely varying mix of three types of parallelism</jats:italic>\n            : parallelism within a convolution operation, intra-output parallelism where multiple input sources (features) are combined to create a single output, and inter-output parallelism where multiple, independent outputs (features) are computed simultaneously. Workloads differ significantly across different CNN applications, and across different layers of a CNN. Second, the number of processing elements in an architecture continues to scale (as per Moore's law) much faster than off-chip memory bandwidth (or pin-count) of chips. Based on these two observations, we show that for a given number of processing elements and off-chip memory bandwidth, a new CNN hardware architecture that dynamically configures the hardware on-the-fly to match the specific mix of parallelism in a given workload gives the best throughput performance. Our CNN compiler automatically translates high abstraction network specification into a parallel microprogram (a sequence of low-level VLIW instructions) that is mapped, scheduled and executed by the coprocessor. Compared to a 2.3 GHz quad-core, dual socket Intel Xeon, 1.35 GHz C870 GPU, and a 200 MHz FPGA implementation, our 120 MHz dynamically configurable architecture is 4x to 8x faster. This is the\n            <jats:italic>first CNN architecture to achieve real-time video stream processing</jats:italic>\n            (25 to 30 frames per second) on a wide range of object detection and recognition tasks.\n          </jats:p>",
  "DOI": "10.1145/1816038.1815993",
  "type": "journal-article",
  "created": {
    "date-parts": [
      [
        2012,
        10,
        11
      ]
    ],
    "date-time": "2012-10-11T14:55:16Z",
    "timestamp": 1349967316000
  },
  "page": "247-257",
  "update-policy": "https://doi.org/10.1145/crossmark-policy",
  "source": "Crossref",
  "is-referenced-by-count": 132,
  "title": "A dynamically configurable coprocessor for convolutional neural networks",
  "prefix": "10.1145",
  "volume": "38",
  "author": [
    {
      "given": "Srimat",
      "family": "Chakradhar",
      "sequence": "first",
      "affiliation": [
        {
          "name": "NEC Laboratories America, Inc., Princeton, NJ, USA"
        }
      ]
    },
    {
      "given": "Murugan",
      "family": "Sankaradas",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "NEC Laboratories America, Inc., Princeton, NJ, USA"
        }
      ]
    },
    {
      "given": "Venkata",
      "family": "Jakkula",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "NEC Laboratories America, Inc., Princeton, NJ, USA"
        }
      ]
    },
    {
      "given": "Srihari",
      "family": "Cadambi",
      "sequence": "additional",
      "affiliation": [
        {
          "name": "NEC Laboratories America, Inc., Princeton, NJ, USA"
        }
      ]
    }
  ],
  "member": "320",
  "published-online": {
    "date-parts": [
      [
        2010,
        6,
        19
      ]
    ]
  },
  "reference": [
    {
      "key": "e_1_2_1_1_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/5.726791"
    },
    {
      "key": "e_1_2_1_2_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1145/1390156.1390177"
    },
    {
      "key": "e_1_2_1_3_1",
      "first-page": "85",
      "volume-title": "DCV 2002. Proceedings. Third International Workshop on",
      "author": "Benkrid K.",
      "year": "2002",
      "unstructured": "Benkrid , K. ; Belkacemi , S. , \" Design and implementation of a 2D convolution core for video applications on FPGAs,\" Digital and Computational Video, 2002 . DCV 2002. Proceedings. Third International Workshop on , pp. 85 -- 92 , 14--15 Nov. 2002 . Benkrid, K.; Belkacemi, S., \"Design and implementation of a 2D convolution core for video applications on FPGAs,\" Digital and Computational Video, 2002. DCV 2002. Proceedings. Third International Workshop on, pp. 85--92, 14--15 Nov. 2002."
    },
    {
      "key": "e_1_2_1_4_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/TCSII.2005.857091"
    },
    {
      "key": "e_1_2_1_5_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/TCSII.2006.886898"
    },
    {
      "key": "e_1_2_1_6_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/TNN.2006.883002"
    },
    {
      "key": "e_1_2_1_7_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1007/s11265-005-4961-3"
    },
    {
      "key": "e_1_2_1_8_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1145/1390156.1390170"
    },
    {
      "key": "e_1_2_1_9_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/FPL.2009.5272559"
    },
    {
      "key": "e_1_2_1_10_1",
      "volume-title": "Asymptotically fast factorization of integers. Math. Comput., 36, 255--260",
      "author": "Dixon J. D.",
      "year": "1981",
      "unstructured": "Dixon , J. D. ( 1981 ). Asymptotically fast factorization of integers. Math. Comput., 36, 255--260 . Dixon, J. D. (1981). Asymptotically fast factorization of integers. Math. Comput., 36, 255--260."
    },
    {
      "key": "e_1_2_1_11_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.5555/1507435.1507438"
    },
    {
      "key": "e_1_2_1_12_1",
      "volume-title": "Neural networks and learning machines",
      "author": "Haykin S.",
      "year": "2008",
      "unstructured": "Haykin , S. ( 2008 ). Neural networks and learning machines . Prentice Hall . Haykin, S. (2008). Neural networks and learning machines. Prentice Hall."
    },
    {
      "key": "e_1_2_1_13_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/VLSIC.2005.1469371"
    },
    {
      "key": "e_1_2_1_14_1",
      "volume-title": "Artificial neural networks in Biomedicine",
      "author": "Lisboa P.",
      "year": "2009",
      "unstructured": "Lisboa , P. , Ifeachor , E. , & Szczepaniak , P. ( 2009 ). Artificial neural networks in Biomedicine . Springer Lisboa, P., Ifeachor, E., & Szczepaniak, P. (2009). Artificial neural networks in Biomedicine. Springer"
    },
    {
      "key": "e_1_2_1_15_1",
      "volume-title": "Neural Networks in Finance: Gaining Predictive Edge in the Market",
      "author": "McNelis P. D.",
      "year": "2005",
      "unstructured": "McNelis , P. D. ( 2005 ). Neural Networks in Finance: Gaining Predictive Edge in the Market . Academic Press . McNelis, P. D. (2005). Neural Networks in Finance: Gaining Predictive Edge in the Market. Academic Press."
    },
    {
      "key": "e_1_2_1_16_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/MLSP.2008.4685487"
    },
    {
      "key": "e_1_2_1_17_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/CVPR.2006.200"
    },
    {
      "key": "e_1_2_1_18_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/ISSCC.2006.1696216"
    },
    {
      "key": "e_1_2_1_19_1",
      "volume-title": "Proceedings of the 15th International Conference on Computer Applications in Industry and Engineering",
      "author": "Nichols K.",
      "year": "2002",
      "unstructured": "Nichols , K. , Moussa , M. , & Areibi , S. ( 2002 ). Feasibility of floating-point arithmetic in FPGA based artificial neural networks . Proceedings of the 15th International Conference on Computer Applications in Industry and Engineering . San Diego, California Nichols, K., Moussa, M., & Areibi, S. (2002). Feasibility of floating-point arithmetic in FPGA based artificial neural networks. Proceedings of the 15th International Conference on Computer Applications in Industry and Engineering. San Diego, California"
    },
    {
      "key": "e_1_2_1_20_1",
      "volume-title": "International Conference on Neural Information Processing (pp. 1081--1090)",
      "author": "Nomura O.",
      "year": "2007",
      "unstructured": "Nomura , O. , & Morie , T. ( 2007 ). Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital approach . International Conference on Neural Information Processing (pp. 1081--1090) . Springer-Verlag. Nomura, O., & Morie, T. (2007). Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital approach. International Conference on Neural Information Processing (pp. 1081--1090). Springer-Verlag."
    },
    {
      "key": "e_1_2_1_21_1",
      "doi-asserted-by": "crossref",
      "DOI": "10.1007/0-387-28487-7",
      "volume-title": "FPGA Implementations of Neural Networks",
      "author": "Omondi A.",
      "year": "2006",
      "unstructured": "Omondi , A. , & Rajapakse , J. ( 2006 ). FPGA Implementations of Neural Networks . Springer . Omondi, A., & Rajapakse, J. (2006). FPGA Implementations of Neural Networks. Springer."
    },
    {
      "key": "e_1_2_1_22_1",
      "doi-asserted-by": "crossref",
      "DOI": "10.1007/978-3-540-75398-8",
      "volume-title": "Audio, Image and Biomedical Signal Processing using Neural Networks",
      "author": "Prasad B.",
      "year": "2008",
      "unstructured": "Prasad , B. , & Prasanna , S. ( 2008 ). Speech , Audio, Image and Biomedical Signal Processing using Neural Networks . Springer . Prasad, B., & Prasanna, S. (2008). Speech, Audio, Image and Biomedical Signal Processing using Neural Networks. Springer."
    },
    {
      "key": "e_1_2_1_23_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.5555/1484785.1484787"
    },
    {
      "key": "e_1_2_1_24_1",
      "volume-title": "Proceedings of SBAI - Simposio Brasileiro de Automao Inteligente, (pp. 78--83)",
      "author": "Wolf D. F.",
      "year": "2001",
      "unstructured": "Wolf , D. F. , Romero , R. A. , & Marques , E. ( 2001 ). Using embedded processors in hardware models of artificial neural networks . Proceedings of SBAI - Simposio Brasileiro de Automao Inteligente, (pp. 78--83) . Wolf, D. F., Romero, R. A., & Marques, E. (2001). Using embedded processors in hardware models of artificial neural networks. Proceedings of SBAI - Simposio Brasileiro de Automao Inteligente, (pp. 78--83)."
    },
    {
      "key": "e_1_2_1_25_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/72.554195"
    },
    {
      "key": "e_1_2_1_26_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1007/978-3-642-03767-2_10"
    },
    {
      "key": "e_1_2_1_27_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/CVPR.2005.254"
    },
    {
      "key": "e_1_2_1_28_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1109/CVPR.2005.177"
    },
    {
      "key": "e_1_2_1_29_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1145/1553374.1553486"
    },
    {
      "key": "e_1_2_1_30_1",
      "doi-asserted-by": "publisher",
      "DOI": "10.1145/1553374.1553453"
    }
  ],
  "container-title": "ACM SIGARCH Computer Architecture News",
  "original-title": [],
  "language": "en",
  "link": [
    {
      "URL": "https://dl.acm.org/doi/10.1145/1816038.1815993",
      "content-type": "unspecified",
      "content-version": "vor",
      "intended-application": "text-mining"
    },
    {
      "URL": "https://dl.acm.org/doi/pdf/10.1145/1816038.1815993",
      "content-type": "unspecified",
      "content-version": "vor",
      "intended-application": "similarity-checking"
    }
  ],
  "deposited": {
    "date-parts": [
      [
        2025,
        6,
        18
      ]
    ],
    "date-time": "2025-06-18T11:22:44Z",
    "timestamp": 1750245764000
  },
  "score": 1,
  "resource": {
    "primary": {
      "URL": "https://dl.acm.org/doi/10.1145/1816038.1815993"
    }
  },
  "subtitle": [],
  "short-title": [],
  "issued": {
    "date-parts": [
      [
        2010,
        6,
        19
      ]
    ]
  },
  "references-count": 30,
  "journal-issue": {
    "issue": "3",
    "published-print": {
      "date-parts": [
        [
          2010,
          6,
          19
        ]
      ]
    }
  },
  "alternative-id": [
    "10.1145/1816038.1815993"
  ],
  "URL": "http://dx.doi.org/10.1145/1816038.1815993",
  "relation": {
    "is-identical-to": [
      {
        "id-type": "doi",
        "id": "10.1145/1815961.1815993",
        "asserted-by": "subject"
      }
    ]
  },
  "ISSN": [
    "0163-5964"
  ],
  "subject": [],
  "container-title-short": "SIGARCH Comput. Archit. News",
  "published": {
    "date-parts": [
      [
        2010,
        6,
        19
      ]
    ]
  },
  "assertion": [
    {
      "value": "2010-06-19",
      "order": 2,
      "name": "published",
      "label": "Published",
      "group": {
        "name": "publication_history",
        "label": "Publication History"
      }
    }
  ]
}