Ask Slashdot: How Many (Electronics) Gates Is That Software Algorithm?
dryriver writes "We have developed a graphics algorithm that got an electronics manufacturer interested in turning it into hardware. Here comes the problematic bit... The electronics manufacturer asked us to describe how complex the algorithm is. More specifically, we were asked 'How many (logic) gates would be needed to turn your software algorithm into hardware?' This threw us a bit, since none of us have done electronics design before. So here is the question: Is there a piece of software or another tool that can analyze an algorithm written in C/C++ and estimate how many gates would be needed to turn it into hardware? Or, perhaps, there is a more manual method of converting code lines to gates? Maybe an operation like 'Add' would require 3 gates while an operation like 'Divide' would need 6 gates? Something along those lines, anyway. To state the question one more time: How do we get from a software algorithm that is N lines long and executes X number of total operations overall, to a rough estimate of how many gates this algorithm would use when translated into electronic hardware?"
Either implement it as shaders for a GPU (or a DSP) or hire somebody who actually knows about hardware design if you are hell-bent on implementing an ASIC.
Slashdot: Where *not* to go to get specific advice about specific technical issues.
AntiFA: An abbreviation for Anti First Amendment.
beowulf clusters does your algorithm desire?
If you learn to program in Verilog, you could try synthesizing for some FPGA and see how much space it takes up on the FPGA. But then programming for an FPGA differs from programming for a serial computer in that each line of code runs essentially as a separate thread, usually triggered on another signal (such as a clock) having a positive or negative edge.
You'd think the "electronics manufacturer" would have some idea how to estimate this.
Make up a number, then when they complain that it was way off, blame it on their management changing scope a hundred times throughout the life of the project!
This is the dumbest ask slashdot ever. Give them the source code and tell them to figure it out. Remind them they are the EEs. I don't know which of the two companies here is stupider.
As a first-order approximation, you can translate your C/C++ code to a hardware description language (HDL) such as VHDL or Verilog. Tools exist for this process. The result won't be optimized for the peculiarities of HDL, but it will provide a good start. From there, you can port the HDL to a Xilinx or Altera FPGA netlist using vendor-specific tool chains. The porting effort will summarize the logic and memory resources of your implementation. Any digital hardware engineer worth their salt should be able to translate FPGA utilization metrics into whatever platform they are interested in.
I think you may have a better chance of getting an answer if you ask this question on Stackoverflow (or one of its related sites).
Unfortunately, I think asking on Slashdot is only likely to get you some tired and outdated memes / jokes...
If they plan on implementing this in hardware, then they should have people who are capable of answering that question. If instead, they are just a manufacturer and aren't capable of doing the actual hardware design, then you have bigger problems than answering this question. That is something you should find out about ASAP.
Develop out the algorithm in Minecraft using ProjectRed (Integration module, specifically) and then you can easily count the gates! :-)
"We don't know."
The most common languages for chip or FPGA design would be VHDL or Verilog. Now there is also High Level Synthesis (http://en.wikipedia.org/wiki/High-level_synthesis), in which you can use C/C++ directly. So if your using a tool like Xilinx's Vivado (http://www.xilinx.com/products/design-tools/vivado/integration/esl-design/) then you can go directly from C/C++ to gate count. However, even in C/C++ it probably needs lots of work from where it is.
The question "How many gates does it take to implement this algorithm?" is stupid. It's like asking "How long is a piece of string?"
There will always be a time/space tradeoff, even with translating an algorithm to hardware. You can save time by throwing more gates at the problem to increase parallelism, or you can save space by reusing gates in sequential operations.
...just rewrite your software in machine code.
You need a C to VHDL translator. Here's a tutorial for one.
Only the parts of the algorithm that have to go really fast need to be fully translated into hardware. Control, startup, debugging, and rarely used functions can be done in some minimal CPU on or off the chip. So, for sizing purposes, extract the core part of the code that uses most of the time and work only on that.
it would only make sense to reuse the same adder circuit for each addition, instead of making a separate adder circuit for each operation.
then you'd add control logic to move the data to adder circuits, multiplier circuits, etc.
then essentially what you have is a microprocessor.
then you just turn that microprocessor into the simplest one possible. which is basically a queue and a stack, and a few elementary logic operations. you can do operations a bit at a time.
so the number of logic gates your program needs it the number to make a queue and a stack, and a few elementary logic operations, and that's probably on the order of about 40 gates.
Implement the algorithm in VHDL and test it on an FPGA. I would imagine you'll need to pay someone $$$$$ to do that for you...
Maybe they can do the translation, but they need a number for how many gates so that they can give a number for how many dollars.
Hello,
It is probable that you can break down your algorithm -(I do not mean code) into a pipeline of elementary processing and find implementations (IP) for each of them.
to give out an estimate:
- subdivise your algorithm into simpler pieces
- find for each simple piece how it can or could be implemented in hardware and the complexity of each piece.
- do the sum.
Indeed an hardware designer or consultant would be of a great help here.
I would suggest finding someone who has a better idea of what a logic gate is. They'll know what to do.
What you want to do is called high-level synthesis (going from C to hardware description language (HDL) to generating gate-lists from that HDL) and there's plenty of software to do that with. A neat open-source package for HLS is LegUp (http://legup.eecg.utoronto.ca/), check it out to get an idea of what the process consists of.
You could write in Bluespec then compile to SystemVerilog from there to something synthesisable...FPGA, ASIC and count transistors.
But many of those transistors will be providing necessary infrastructure in addition to the "raw" algorithm
Only idiots write "C/C++." As soon as I see that, I stop reading. It is clear indicator that the writer doesn't know what he's talking about.
It's about more than gates. It is about registers, ALUs, gates, and how they are all connected. There are many different possible architectures, so it depends on the design: some designs are faster but take more real estate. There are algorithm-to-silicon compilers (I know: I wrote one for a product company during the '80s and it is apparently still in use today) but each compiler will assume a certain architecture. I would recommend one but I have been out of that field for decades.
There are a number of tools on the (commercial) market that can compile (a subset of) C to hardware into a hardware description language (Verilog/VHDL), e.g. from Cadence and Synopsys. See http://en.wikipedia.org/wiki/High-level_synthesis for an overview of the approach and links to tools. There are also some open source tools that can turn C code into Verilog or VHDL, but they are not very mature in my opinion.
However, you will not get a single number for gate complexity out of these tools. Depending on the requirements and tradeoffs (smaller chip area vs. higher speed, single cycle vs. pipelined implementation, target device as FPGA or ASIC), the number of gates (or logic blocks for FPGAs) required will differ significantly. From my experience, you definitely need a hardware design expert to obtain useful (i.e., more-or-less optimized) results from these tools - and you should expect having to invest significant effort for restructuring your C code so the high-level synthesis tools can grok it.
Here is a proven method for calculation.
If your code is:
a) C: divide the number of lines with 7
b) C++: divide the number of lines with 5
c) Ruby/Python/Java: divide the number of lines with 3
d) Perl: multiply the number of lines with 42
e) C#: resign.
...but he's pissed that the lawmakers are focused on re-election and partisan bullshit, so it'll never get done.
The answer, as you might imagine, is complicated and depends on how these gates are implemented. Think for instance you could design a chip to do this, you could write RTL to do this in an FPGA, or you could even write the algorithm into more software on an embedded processor of some kind. Is this electronics manufacturer one that makes chips or one that makes systems (boards, cases, etc). If it is the former they should have people who can work with your people to figure this out. If it is the latter then why do they care? Are they really asking you to provide a chip which implements your algorithm? Ask some more questions...
The question seems so ill-posed that one has to wonder if there's a product or service advert lurking... but assuming this is real.
Software doesn't automatically translate directly to hardware. As others have noted, break out the algorithmic core from the setup and finish. Presumably there is some part of the code which is the most critical in steady state. Describe that to their hardware engineers in whatever depth is required. Depending on the algorithm, the ASIC library elements available (or FPGA units, etc.) you may want to make some substantial adjustments to the "code" to make it fit within the design parameters of the available device. This should be an iterative process, not a single estimate based on a pure software perspective.
If there isn't a clearly identifiable set of "hot blocks" the chances of there being a good hw implementation fit is poor. If there is, it may still be necessary to change the algorithm details to fit but it should be "doable". Whether it is worthwhile depends on the volumes and the performance gains.
It's typically not that hard to convert C/C++ To SystemC and then use one of the many SystemC to gates flow. Various tools have come and gone in the past decade as C/C++ to gates flow isn't exactly optimized and is difficult to do functional equivalency checking.
How many Gates will it take to implement your software project?
One. His name is Bill, and here is yours.
Science advances one funeral at a time- Max Planck
Mod this offtopic if you want but now I can't see my comments, I can't see if anyone has responded to them, and it has become almost impossible to participate in discussions as a result. WTF, /.?
Drill baby drill - on Mars
...this should get you started:
http://en.wikipedia.org/wiki/C_to_HDL
Find a suitable converter, then grab a free (or evaluation) version of an FPGA design tool, for example one of these (I only suggest these over the many other, probably equally as good alternatives, as I've used them myself):
http://www.xilinx.com/products/design-tools/ise-design-suite/index.htm
And with a bit of work you should be able to produce output that will essentially be your code implemented in programmable logic, and the tools will tell you the number of gates/cells required.
What I would say, is that you'll have a much easier ride if your algorithm is in C rather than C++.
Despite saying that you have no experience with this sort of thing, defining logic in something like VHDL is basically programming. Sure, you'll need to develop a fair understanding of the hardware, but with the libraries of pre-built components available from the numerous companies who produce programmable hardware like FPGAs and CPLDs, you may find you could do a lot more than you think yourself.
Haven't tried it, but Cadence's C to Silicon might be up for the job. Also keep in mind that in hardware you have very different requirements than in software, and parallellisation has interesting effects on the number of gates. The best option is to get an EE, preferably with experience in digital design, to take a look at it. Other options are SystemC compilers, but they're not really up to production use yet as far as I know. And it is also very technology dependant, sometimes complicated logical functions that are common are implemented directly. This isn't something you can just wing!
Just give the number of gates in whatever processor you're running this thing on as an upper bound.
There really isn't a great way to answer your question without a detailed analysis of your code.
There are more factors to the number of gates required for a given task than just the complexity of code. Clock speed can be a major factor in determining the number of gates required for a given algorithm. Another major factor is the part you are targeting. The number of design elements in FPGAs used can change just by targeting a different device family.
Even if your algorithm was small enough to fit into a part, there are other issues that could arise (such as not enough bandwidth or pins for your memory device(s)).
It sounds like the electronics manufacturer doesn't have the resources to determine the number of gates for you. It looks like your only avenue is to ask a third party to review your code (under NDA) to help you determine the approximate gate requirements. This won't be cheap.
google: https://www.google.com/search?q=convert+c+code+to+fpga
Clearly it's not possible to render a software program as hardware. If everyone who explained the process (use Verilog) above is correct, that would mean that the exact same algorithm exists as both hardware and software.
We can't have the same algorithm exist as both hardware and software, because that would mean algorithms are hardware just as much as they are software.
that would mean all the people whining about "software patents" may as well be whining about unicorns. I hereby declare Verilog, ASICs, and FPGAs to be non-existent so we can continue to pretend that there is such a thing as a "software patent".
This is a horrible question to ask. Software is a tool to lower hardware requirements.
Compile your algorithm to the simplest RISC architecture reasonable. For most, something among the lines of ARM or MIPS works. Then, take note of all variables and add up how much RAM they'll take. Consider every bit (yes, bit, not byte) as a D-flipflop and convert every instruction (post-compile, in assembly) into a respective set of logic gates. A bit of googling should get you those values.
If your algorithm is reasonably complicated, chances are, you'll get a number that seems absurdly high compared to what state-of-art hardware is available.
In practice, it's probably best to just pick an off-the-shelf CPU and run the software on it. There might be some parts that are better done in hardware than in software, but you should get someone who knows what they're doing for that.
You already have your algorithm running in electronic hardware, right?
Your current gate count is the sum of
* the gate count of your CPU
* the gate count of your RAM
* the gate count of your program ROM
So that's an upper bound on the gate count.
If that number is too big for your manufacturing partner,
then you have an optimization problem.
Optimization is a hard problem...
Write out the truth table for each output as a Karnaugh map incorporating every input. Count the number of gates needed to solve the map, and that's your answer for that output bit. Repeat for every other output bit. Add all those numbers together, and that's a fair estimate of how many gates you'll need.
Of course, this method requires that your number of input bits must be fairly small. Don't forget that memory counts as both input (when read) and output (when written). For nontrivial applications, you'll find that the number of gates quickly approaches "a lot".
You do not have a moral or legal right to do absolutely anything you want.
It doesn't take many gates for a Turing machine that will run your algorithm but it's likely to be slow. A proper hardware implementation will optimize everything and be as parallel as possible.
The problem as stated isn't adequately constrained.
You guys are probably trying to get a manufacturer to make Scrypt-mining ASICs.
Get free satoshi (Bitcoin) and Dogecoins
You guys really need the help of a Comp E or EE. Not to be super critical but from reading your post/question it's pretty easy to see that you don't understand hardware design at all. Nothing wrong with that but you're clearly purely software. Any tools to convert code to HDL is not going to help you at this point because you're not going to understand the hows and whys of how things work on a gate level. There's no magic program that can take a program of some complexity and turn it into logic gates.
Maybe you could get by with an embedded processor and some peripheral ip blocks in your asic... maybe the whole thing is one giant 32 stage image processing pipeline. It really depends on your algorithm.
FYI reversing a bit (1 bit) takes 2 gates or 16 gates per byte. 'Add' takes several hundred if you do anything more (look ahead) then the most basic versions. 'Divide' takes thousands of gates. Any memory you use as buffers will be on chip if you want to get the performance of an asic and that costs gates (4 gates per bit) and there's no free/malloc (no memory reuse basically). Any external memory access will require a dram controller so that adds gates.
1. Solicit buyers
2. Land one
3. Post on Slashdot
4. ???
5. Profit!
Speed of result and gate count are directly related to each other, except for a single bit input and output.
Create some reasonable metric, for example 5 gates per instruction. Compile your program and somehow get the number of instructions in assembler, then multiply by 5 or more. In good case you could expect 1-2 gate(s) per instruction, in bad case you could expect 2-3 gates per instruction, and those two more gates are just "to make sure" you won't underestimate it.
If you don't have anyone on your team that has done ASICs or FPGAs, the only correct response is "We don't have the expertise here to estimate that."
People have mentioned C to VHDL translators. While they exist, they aren't magic and will not give you a useful answer. They do not "know" what parts of the design are important to do in logic and they do not "know" what tradeoffs you want in order to hit the right balance between speed, gate count, and so forth.
Do not ask a computer scientist to be an electrical engineer.
Theoretical answer:
Recode your algorithm in SystemC (a c++ library that can be used to implement a register transfer language representation of your algorithm) and synthesize it with one of the available tools (e.g., Accelera, Synopsys, Calypto, etc) targeting a typical library (e.g, 28nm TSMC), at a particular clock frequency.
Practical answer:
Ask someone with hw design experience to estimate it for you...
FWIW, nobody wants an "exact" size in logic gates, all they want an idea in complexity. The big ticket items people care about are the size in bits of RAMs (and how many simultaneous read/write ports it might need) and complicated math that is likely to take more than 1 clock cycle to complete (e.g., like a floating point math operation) and the data-width of the main data path at the throughput that you want to have. Simply multiplying the data path width by the estimated number of pipeline cycles is generally proportional to the eventual area minus the RAMS and special math ops (which is why you need to identify those parts separately).
Generally, I've found that naïve "software" algorithms have not been very amenable to HW implementation without some amount of rework and the fact that you do not have an answer to the posed question would likely lead me (and probably your potential customer), that your algorithm is half-baked from a HW implementation point of view... Just food for thought...
Write it in C. Not C++, not Perl or PHP or, gods forbid, Java. Actual C.
Then run it through some debugging programs to get a ballpark idea of how many instructions are actually being implemented, and how many library functions are being called if you can't isolate the actual process to a small enough operation.
As far I Know, if you want to run an algorithm , you need a Von Neuman computer, with rom, ram and I/O , because your algorithm need to read and write data values and read the execution code itself. This arragement is much more eficient than singles logical gates. What You will do when find a bug in your code ? Bake another bunch of chips ?
Short answer: you need to contract an electronics engineer. Possible: You could dump the non-optimized assemby language (-S on most compilers) for a popular processor family e.g., 80686, PA-RISC, etc. The manufacturer probably has resources to convert "this pile of 80686 instructions" to "an ASIC that does the same thing really well".
For what it is worth, here is a circuit I developed to see what the gate configuration (nor only) would look like for the implementation of a condition that the input switches be:
0
1
01
11
http://www.neuroproductions.be/logic-lab/index.php?id=3699
you know, the counting integers 0,1,2,3 - the same code that I have on my luggage. I thought there might be a fun implementation related to security or something - a hybrid mechanical/electronic locking system.
It turned out to be super hard for me to figure out this final result. Nonetheless, the result was most interesting and I encourage you to find a more efficient configuration.
I did this using basic logic and a crapton of that time-honored tradition of guessing and trial and error.
I can only begin to imagine the complexity of trying to implement and design circuits based on algorithms written in anything above assembler level.
"It takes one gate that accepts our input and outputs a desirable answer. We would like you to design that gate."
*oopsy* that should have been
;)
0
1
10
11
but you knew that
This Bitcoin mining thing will never go anywhere :-)
Then, stick your pinky into the corner of your mouth and do your best evil laugh!
I swear to God...I swear to God! That is NOT how you treat your human!
There's been several people who suggested using a high-level synthesis tool to convert your software (c/c++) directly to HDL (verilog/VHDL) of some kind. This can work and I've been on this task and seen it's output before. The catch is; unless that software was expressly and purpose written to describe hardware (by someone who understands that hardware and it's limitations and how that particular converter works), it almost always makes awful and extraordinarily inefficient hardware.
Case in point - we had one algorithm developed in Simulink/Matlab that needed to end up in an FPGA. After 'pushing the button' and letting the tool generate the HDL, it consumed not just 1 but about 4 FPGAs worth of logic gates, RAMs, and registers. Needless to say the hardware platform only had one FPGA and a good portion of it was already dedicated to platform tasks so only about 20% was available for the algorithm. We got it working after basically re-implementing the algorithm with the goal of hardware in mind. The generation tool's output was 20 times worse than what was even feasible. If you're doing an ASIC you can just throw a crap-load of extra silicon at it, but that gets expensive very quickly. Plus closing timing on that will be a nightmare.
My job recently has been to go through and take algorithms written by very smart people (but oriented to software) and re-implement them so they can fit on reasonably sized FPGAs. It can be a long task sometimes and there's no push-button solution for getting something good, fast, cheap. Techies usually say you can pick two during the design process, but when converting from software to hardware you usually only get one.
Granted this all varies a lot and depends heavily on the specifics of the algorithm in question. But the most likely way to get a reasonable estimate is going to be to explain the algorithm in detail to an ASIC/FPGA engineer and let them work up a prelim architecture and estimate. The high-level synthesis push-button tools will give you a number but it probably won't be something people actually want to build/sell or buy.
I just tried this and all my money was transferred to a different account.
http://davidmoorephoto.com/storage/cache/images/000/975/1I2B9665,large.2x.jpg?1372317810
Everything traditional for graphics has been done for ages in software, or in traditional hardware. Algorithms are known, and have been implemented. No-one would need to ask any questions. Now there's a new kind of algorithm, and questions are being asked. Means, something unusual, used for some unusual purpose, in a context where all-purpose processing capability is too limited / too expensive / to energy-consuming. Phones have been carrying around sufficient power for quite some time, anything wired is a non-issue. Something mobile, energy inaccessible, high expected runtime. Number of gates seems to matter, a small device. Ball-pen. A dedicated silicon for this purpose, price does not seem to matter that much.Wrist-watch. The algorithm itself is unusual, else there would be no need to ask questions. Nothing really new in the 2d or 3d area. Either a different kind of 3d than what usual hardware does, voxels or similar, or rendering something holographic. Wrist watch with holographic hands?
Neat. I want one.
Please make the FPGA as big as it can fit in to keep the frequency and thus energy consumption down. Don't want to recharge my watch as often as my cellphone.
Take all permutations of inputs, generate all possible outputs, put them in a table. How much space do you need to store the table?
Well, that would give you one boundary condition, granted, a pretty extreme measure.
On the plus side, if you implement this way, it takes constant time to generate an output.
The first step is to compile it into an architecture-neutral assembly program. Then, appropriate tools can turn it into a machine-specific set of instructions, and the answer will become apparent, or at least a reasonable approximation!
While an interesting question (I didn't even know hardware manufacturers were in the habit of converting software into hardware), why don't they figure it out themselves? They must have the tools/people to do it. Are you afraid they'll "steal your algorithm" if you give them the source? (that's much less interesting)
--
Stay tuned for some shock and awe coming right up after this messages!
Making electronics out of an algorithm depends mainly on the architecture that is going to run the algorithm. The only scenario in which estimating a gates count makes sense is if the electronic manufacturer wants to put it on an FPGA or make an ASIC out of it. If that is the case, a C-to-Gate compiler may be the route to go.
There are also plenty of off-the-self embedded processors with HW acceleration, DSP and GPU that may very well also tackle the problem efficiently on embedded software. And porting to embedded software is less expensive than converting it to gates. It depends strongly in your algorithm but the bottom line is that you may not need an electronic manufacturer as a partner.
How many gates could a gate chuck chuck if a gate chuck could chuck gates?
and go from there. Do NOT listen to /.. HLSs have come a long way but I'd never really trust them(especially for an ASIC design). Also, FPGA LUT/BRAM/etc usage isn't really a great estimate of how many ASIC gates will be used. If you want a decent answer you're going to have to pay for it and you're NOT going to get it from /.. Good luck though, I think you'll need it.
At*
Nice troll. Algorithms can be implemented in many different ways in either software or hardware, much in the same way ideas can, because algorithms are ideas. Patents (used to) cover only specific implementations of ideas, not the ideas themselves.
Look at Mathworks. They have a solution for this.
Compile the project and take a look at the generated machine code. Take the unique set of those instructions and look at the circuits needed by your x86 architecture to execute those instructions. This could be a reasonable estimate.
Go to a hobby site about EE learn about tools they use. Slashdot use to be is SE site...
Seems like you should head on over to the MAKE forums http://makezine.com/forums/
Looking at the responses, I'd say slashdot is a pretty good place to get technical advice.
Cadence's C to Silicon is just part of Cadence's Digital Implementation suite (whatever name they gave to it). As with any new tool to you, it will take some time to get something out from it. It is called C to Silicon, but it supports also SystemC to Silicon. SystemC is an extension to C++, although only a subset is supported for synthesis. You are free to download SystemC from Accellera.
A quick estimation of software complexity is hard to do. One thing the OP can do is count up how much memory the algorithm needs, as memory is usually expensive and might be slow to access. If the algorithm is a typical data in / data out process, he/she should also get a list of operations that needs to be done per input data unit. Let me get an example, video deinterlacer methods ( see http://www.100fps.com ). Discarding even or odd lines is the simplest solution. Blending needs some video lines in order to do the vertical rescaling. Weave needs you to store a video frame, or half image, in order to display both even and odd lines at the same time. And at the end you might consider RAM intensive algorithms like area, temporal or motion based algorithms, with added extra motion blur.
So, no, it is not easy to guess what algorithms will take. However, it should be straightforward to know RAM usage.
I see some people mentioning high level synthesis (HLS). This is the right way to go. There are academic tools, tools given practically for free by FPGA vendors, and then tools that sell for $$$ (Disclaimer: my company sells one such commercial HLS tool). HLS has come a long way, after more than 25 years of R&D going into it. Graphics, image processing, signal processing, etc. are the sweet spots of HLS, where the quality of the generated RTL is comparable to or better than human written RTL. Looks like you'll fit into the sweet spot.
So for quick and dirty estimates, you can taken an academic tool. For more serious work, I suggest you get an evaluation license from one of the HLS vendors, or FPGA companies, and try to get your synthesis flow down to the gates going. Note that there are two types of synthesis tools required in the tool chain to get gates. One is the HLS tool, and one is RTL synthesis tool. However, for estimation purposes, you can put some faith into the estimates that the HLS tool itself will provide.
Amen to Asmodae teachings. I work with people that translate from Matlab, written by bright mathematicians, into Verilog, and let me tell you it is not as straightforward as many people think. Some way to do things do not fit into existing hardware.
x264 is the BEST video encoder in the world, and rests muchly on the shoulders of one young programmer. This guy was associated with a video rebroadcast company that wanted the best possible solution for scalable H264 encoder systems. Now, x264 is open source, and usually runs on various types of standard x86 PC equipment. The young x264 guy convinced himself that OBVIOUSLY x264 algorithms could be moved to dedicated hardware logic, and sought financial backing from the broadcast company to finance this.
Long story short- many years later, x264 is STILL a solution almost entirely used on general PC hardware. The attempt to move the decent quality modes of x264 encoding to dedicated hardware was a total failure. The dreadful quality of Intel's (very slow) H264 hardware encoder built into most current Intel CPU parts shows the problem. Hard wired logic is very good at very dumb and simple algorithms, but lousy when the algorithm becomes more sophisticated. This fact is why we have CPUs (and the modern computer architecture) in the first place.
What company would be so cretinous to seek a hard-wired solution, when an ARM SoC, capable of driving a 4K display, can be had for less than 5 dollars?
I would remind you that a few years back, when ill-informed home cinema enthusiasts were paying 1000 dollars plus just for a video box that scaled the video image, people with brains were getting VASTLY better scaling functions from their PC video cards that cost less than 200 dollars. The dedicated hardware solutions were putrid (and obviously completely non-reprogrammable).
However, most 'professional' (and I use the word very lightly) video personnel in the broadcast industry still use insanely over-priced hardware junk from dinosaur companies that date back to the birth of TV, and get business on their (long irrelevant) reputations. So, for instance, the quality of the vast amount of broadcast TV is GARBAGE, because real-time H264 (and MPEG2) encoder hardware boxes are used, that produce the worst possible compression possible for the unthinkably large file sizes each hour of TV represents.
Again, what 'video' algorithm cannot be implemented as a general shader operation on the GPU of a modern, super cheap ARM SoC part?
Computers are so cheap and low power today that turning an algorithm into gates would be a silly way to proceed. So the question is not really relevant except academically.
We (ConcurrentEDA.com) have developed a tool call Concurrent Analtyics that analyzes a program's x86 code and estimates the gate count. This tool works for Xilinx and Altera FPGA chips and provides an upper bound since logic optimization reduces the gate count. Essentially, we have an extensive library of all software assembly instructions and their gate count in an FPGA. Synthesizing software into a chip requires more work but we have an internal tool for that as well. We translate x86 into a hardware description language (HDL) that the vendor's tools synthesize into FPGA gates. Over 1 million lines of high-performance HDL have been generated using these tools since 2006. Both tools are internal tools that we use to offer accelerated FPGA design services. (feel free to contact me directly RayHoare _at_ concurrenteda _dot_ com)
Knowing what algorithm you want to run in hardware in not even close to enough to estimate gates. You need to know the algorithm, and the required performance, and have a sketched out HW design that meets those goals. THEN you can estimate gate count.
For a simple example of why this is, consider processors. A 386 and a Sandy Bridge i7 implement very similar "algorithms" - it's just fetch->decode->execute->writeback all day long. If you implemented them in software emulation, it would be very similar software with some additional bits for the newer ISA features on the i7. But a 386 is about 280 THOUSAND gates, and the i7 is about 350 MILLION gates/core - three orders of magnitude different. Of course, there's at least a 2 order of magnitude performance difference too - it's not like those gates are going to waste.
Point is, knowing the algorithm isn't enough to get even a finger in the wind guess at gate count. If you need an answer to this question, you need to get competent HW design people looking at it.
because if the hardware company is thinking "gates" instead of "cycles," they want to implement it in a FPGA. hell, if they were going to put it on a dedicated microprocessor, they'd just recast it with libraries for that processor and recompile.
if this is supposed to be a new economy, how come they still want my old fashioned money?
The manufacturer is probably asking how many gates you need to implement the algorithm exactly as it is coded, with exactly as much parallel or sequential logic as it already has, and that will have a fairly specific answer.
While that number could be determined, it would not be very useful. Hardware implementation, especially when targeting FPGA's, get most of their performance advantage by exploiting more parallelism than is achievable by running on a processor.
No, the manufacturer isn't make any assumptions about how the algorithm is translated. The deal in gates. Gates are the most direct measure of how much the hardware will cost to manufacture.
Without a direct number for gates you will to come about in in a more indirect fashion. How much memory does the algorithm use? What data structures are used how big are they? (*all* data structures. An integer is a data structure for this purpose) What operations (adds, subtracts, etc) are needed and how many are required to go from input to result? With those you can usually come up with a ball park guess of how many gates will be required. There are always optimizations and non-obvious operations that get overlooked but it is a good start.
The answer is "too many"
"You cannot directly interpret a software algorithm to hardware." Why? Here are the follow ups: What type of hardware, FPGA, GPU, custom ASIC? What part of the algorithm NEEDS to be in hardware to gain performance over basic system resources (CPU, GPU)? Who is going to pay for this little experiment?
As others more qualified have already stated, you rarely if ever get a direct translation nor do you always need to interpret the entire algorithm to hardware. For a hardware manufacturer to even ask the question is suspect, unless it was a sales or marketing rep, then it might make sense. The hardware people will know best how to do this, for them to ask you ... RUN!
My suggestion would be to say thank you and stick with software. You will probably spend enough time working this out that someone else will implement it before you, better. If you're not talking to Nvidia, AMD or Intel you're probably wasting your time.
I seem to recall that during my hardware classes for Computer Science, there was a method for designing our own ALU and arithmetic operators by using CMOS logic. It's been a while, but the computational part of your algorithm can be designed by means of CMOS logic. It might not be the most efficient design (Heck, you can make everything with NAND's, but it might not be optimal) but it'll give you at least some indication of the required hardware for the execution of the algorithm.
Eventual memory use and storage of variables do fall out of the scope of this, however. And it tends to be quite some work.
The best path for you would be ForteDS Cynthesizer, Mentor Catapult C and C-toSilicon from Cadence. Those are behavioral synthesis tools. I have used some of those and they are very strong for datapath oriented designs, if you follow their design guidelines they are quite good to convert C code to RTL. There's a free thing you can try call xPilot, never touched it though...
linux user #271173
You can't patent math. Does that mean that the world doesn't exist?
We (ConcurrentEDA.com) have developed a tool call Concurrent Analtyics that analyzes a program's x86 code and estimates the gate count. This tool works for Xilinx and Altera FPGA chips and provides an upper bound since logic optimization reduces the gate count. Essentially, we have an extensive library of all software assembly instructions and their gate count in an FPGA. Synthesizing software into a chip requires more work but we have an internal tool for that as well. We translate x86 into a hardware description language (HDL) that the vendor's tools synthesize into FPGA gates. Over 1 million lines of high-performance HDL have been generated using these tools since 2006. Both tools are internal tools that we use to offer accelerated FPGA design services. (feel free to contact me directly RayHoare _at_ concurrenteda _dot_ com)
I don't know how easy it would be to port your specific algorithm, but I did my masters thesis around a language called Handel-C. It's a super-set of C that provides a high level FPGA programming interface. That might get you some distance in determining the number of gates. Disclaimer: I was working with it a few years back and the documentation/support was appalling, I don't know if it's become any better.
It seems like, if you could describe the algorithm in a sufficiently low-level language like C, they shouldn't be asking you how many gates it would take. If they're the hardware manufacturer, they should know. Besides, there are too many factors that could influence the gate count depending on how the manufacturer decided to implement the adders, etc. None of these things seem like questions that programmers should be responsible for answering.
Also consider that there are plenty of companies around that will happily do the translation from code to gates for you and have experience with these design flows. Talk to them, you'll see that translating algorithms to gates is just one of the pieces of the puzzle.
Truly this is a bizarre question to start with,
Software algorithms do not exactly translate to hardware, only some functions of the code can be translated...
For the Hardware Manufacture to ask this question is even stranger if the code. In the end none of this adds up.
It smells more like a some trolling for advice on converting some sweet code that fails to scale cost effectively into something that can be stamped as an asic....
Then again... what do I know
Your electronics manufacturer's question shows that they don't know what they're doing. Hardware and software engineers aren't supposed to even think at the gate level anymore, that's the foundry's job. If neither you or they don't have people who can translate between your algorithm and the foundry's toolchain they should at least know a capable partner who they've worked with successfully in the past.
If they can't even offer you a capable partner then they're probably a lame middleman -- find someone else.
Now that the practical answer is out of the way, let's answer your actual question.
Generally, any algorithm you can describe in a programming language like C is turned into a custom processor when implementing it in an ASIC. The overall steps are:
1) Product requirement analysis (need vs want feature breakdown, cost constraints, timeline, human and capital resource analysis)
2) Based on 1), determine whether a simpler approach such as a FPGA or synthesizable core will work instead of a full custom design. If so, then just have your partner do that.
2) This bears repeating: you really want something like an FPGA or a synthesizable core. There hasn't been a successful business case for hardware acceleration ASICs except in extremely high end boutique markets in recent memory.
3) If you really, really have to do a full custom design and based on 1) you can afford it, break down the algorithm into steps and draw any data dependencies. The longest chain of data dependencies determines your pipeline length. This sketch is the "datapath diagram".
4) Look at all the math operations and turn them into basic logic blocks using your first semester logic design textbook and your foundry's cell library.
5) Figure out the worst case scenario for each logic operation which might be executed simultaneously and build that many of those logic blocks.
6) Now that the datapath and logic units are sketched out, design a control system which uses a hardcoded ROM to attach the various logic units to the buses and each other in the proper sequence to implement your algorithm.
7) Attach a simulated clock to all of this and do a lot of debugging.
8) Synthesize all of this junk using the foundry's toolset and do even more debugging.
9) Eventually the foundry will accept the design (DRC pass) and you can tape it out to them. They will bill you a very large amount of money at this point.
10) You will get a shipment of chips which won't work and there will be even more debugging. At some point you have to send a mask or two back to the mask house to get them to fix something. The mask house will bill you another very large amount of money at this point.
11) Eventually after repeating 9 and 10 several times you'll either run out of money or you'll have a successful chip. The amount of resources poured into the project and smart people you'll be working with will make you laugh hilariously at ever considering posting such a question to slashdot.
Note that at no point were gates ever involved.
Please buy a dictionnary and learn what a patent is, and what is patentable. then you'll understand what people mean by software patent, and why it is evil.
and btw it is only one of the many terrible trends that corrupts the patent system and transform it from a innovation incentive the the exact opposite
You'll end up with a hardware emulation of a software algorithm, which will necessarily be slower and less efficient than the correct answer, which is to design a hardware solution to the original problem.
http://www.plexus.com/solutions/design
That is a common misconception, spread by people who like a certain type of FUD. In fact, what is not patent patentable is "the laws of nature, including those of science and mathematics".
The LAWS of nature. You can't patent gravity, you can patent an elevator. You can't patent refraction, you can patent an acoustic lens. You can't patent the associative property of addition, you can patent a scoring system for detecting bogus reviews.
If you take out "the laws of" and replace it with "anything using", THEN you would end up with "you can't patent anything using nature, including science and math", but that's not the law. The law is that you can't patent the laws of nature, including mathematical LAWS. You can patent things that are scientific, and you can patent things that are mathematical.
If y
you think about it, it makes sense. You can't invent "x + 1 = 1 + x". That's always been true. However, you CAN invent a way of detecting suspicious stock trades. Since that could be a new invention, it could be patented.
> Patents (used to) cover only specific implementations of ideas, not the ideas themselves.
That's a legitimate criticism of many patents. Of course, an implementation IS itself an idea, so we'd need to be a little more specific with our vocabulary in order to really talk about policy. Saying "you shouldn't be able to patent ideas" won't quite get us there. Certainly we can say "goals, objectives, shouldn't be patentable; only METHODS for achieving an objective should be."
Certainly there exist bad patents that are too broad, that cover an objective rather than a method or mechanism. On the other hand, there simply is no such thing as a "software patent" per se. The problem with the patents is that they are over broad. Whether they cover something made of wood, plastic, or magnetized iron particles is irrelevant.
http://en.wikipedia.org/wiki/Turing_machine
nough said
How many (logic) gates would be needed to turn your software algorithm into hardware
Non-nonsensical question... they're very different beasts, processing in very different ways (generalised vs specialised hardware). The answer really means *designing* the whole thing again in hardware. Or just replying with the number of gates in the entire PC.
Haven't done this myself, but you can evidently run Labview programs ("virtual instruments") on some FPGA chips. You'd have a good estimate (plus an actual digital circuit) if you translated your code to labview (I believe the actual language is called "G") and found a copy of the add-on which turns this into verilog. -- Dustin
As a first pass you can estimate adders as 10 gates pr. bit, state as 20 gates pr. bit and multipliers as 10x bits squared (Unless it is by a power of two in which case it is free) If you need to to division in your algorithm you should redesign it. If you use floating point, everything gets huge (Try not to use floating point, remember in hardware you do not need to deal with arbitrary word size restrictions, just scale word sizes to suit the requirements)
Now, figuring out exactly what resources you need, this is where you will get into trouble. Normally you will reuse some (lots) of your arithmetic, but exactly how much depend on what performance/power/gate count target you need to hit. More reuse means less gates but faster clocks (Which can drive you to more gates if you get into trouble on timing closure). The extreme case is software which just reuse a very limited set of ALUs, the other extreme is an unrolled design where algorithmic operation have dedicated hardware, so one iteration takes one clock.
Depending on performance targets the same algorithm can have a factor 1000 difference in gate count.
http://www.c-to-verilog.com/
Most programmers make terrible ASIC designers.
Getting the SIZE or NUMBER of gates is one part of the equation. Computing or evaluating the performance is another.
How many CLOCK cycles does it take to do that operation. Not machine cycles, but CLOCK cycles. That is:
-FETCH, DECODE, EXECUTE, RETIRE, each on of these operations may be more than one CLOCK cycle, depending on data locality, cache hit/miss, and overall architecture.
what is a simple java line, may require hundred, if not thousand of gates, and many many clock cycles.
When designing hardware, in real ASICs, you have essentially NOTHING. No memories [besides bit-wide/deep], no multipliers, no dividers, no floating point, no IEEE compliance, unless you either licence it, or design it yourself.
Then we get into, how rich is the technology library, what kind of gates do you have, what transistor sizes, what speeds...
That depends a whole lot on what kind of hardware you want to use. One way is to implement a universal Turing machine, and give it the code as input. Those can be quite small, and you don't even need access to the algorithm to find the answer.
You're probably looking for a more efficient implementation.
There is a tool that automatically converts executables targeted for specific processors (the ATMEL AT90s8515 and AT90s8535) into custom circuits in Verilog. If your "algorithm" were compiled using AVR Studio to create a .bin, you could use that tool to generate a Verilog circuit, then synthesize it using a synthesis tool (such as the Xilinx synthesis tool) and get an estimate in a matter of minutes. See www.fromorrow.com for details.
I believe the OP is asking the question with an underlying motive that most users aren't grasping - The manufacturer definitely has a way of estimating the gate "cost" from C++, as some experts on the matter have pointed out here, but for that he probably demanded source code, which the OP probably has no safe way of handing over without compromising his Intelectual Property. He doesn't want to lose the business contract or spend money blindly on a consultancy he doesn't even know which's name is, so the question makes FULL SENSE regardless of its child-like semantics.
You can probably bet the manufacturer is based and/or has legal safe-haven in a dodgy country, along the lines of having properties like:
(hint: China) ...This makes the OP think twice about passing source around.
Now, my personal opinion regarding a possible answer is more business-focused - if such a kind of manufacturer is even remotely interested on your "product" as to ask that, then you have a very marketable piece of code on your hands and you need to do the following...
Charles H. Moore wrote it and extended it to be able to compile directly into silicon.
Forth is actually a TIL [Threaded Interpretive Language] but it is so easily extensible that it is possible to implement all the way to the gates.
Moore was working with Forth to do exactly that last I heard.
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
If you application is complex, then you're SOL. You will probably need a state machine, or a complex of functional units coordinated by some kind of programmable unit.
However, for a simple loop, rough rules of thumb:
* count the upper bound of the loop (say N), and the number of arrays used in it. For each array, allocate a RAM of N rows x b bits where b = 64 bits if you need double precision otherwise 32 bits. That gives the RAM size
* count the number of operations that do not involve the loop count. Count addition+subtraction, multiply & divide operations, and integer/float/double operations separately. Ignore any operations where one of the arguments is a constant.That should be enough for a h/w guy to estimate the logic gate count
* Figure out how many iterations of the loop you need to do per second. That will give you the cycle time.
* Figure out the critical path of operations through the loop. Assume integer add/subtracts have wt 1, double precision add subtracts, integer multiplies have wt 3, integer divides have wt 6
and double precision divides have wt 10. Find the longest cascade of operations in the loop. That will tell the h/w guys an idea of whether they can meet the cycle requirement without some radical engineering (roughly speaking if the wt of all operations in the critical path * number of operations per second > 1e9, you'll have to rethink things), and how to size the various compute units (e.g. whether they can use a ripple carry adder, or have to do a look-ahead).
Again, these are VERY rough rules of thumb.
It is more than 30 years ago I learned digital logic from Blakeslee's Digital Design with MSI and LSI. These days I program an Arduino. I have my hands full just reinventing the spokes of a 20 year old wheel.
I think you have a fascinating problem. Suppose you treat your computer program as a black box where you feed pages of data into one end and you get pages of output data out the other end. Suppose you say each page of data is an x,y grid of image values. You could say, your problem is for a central pixel in the image, you want to write a truth table. An initial truth table is the values and locations of the pixels from the immediate preceding image that when always present always result in the specific value of that pixel.
Your image processing process probably uses data from several preceding images to come up with the result. If it takes five preceding images, then the truth table for a single pixel picks up five more blocks of data about the state of the surrounding pixels. No matter how wonderful the computer program may seem to be, it is still a finite state engine. The present state of the image or output (we hypothesize) should be dependent on the some number of previous states of the image.
The process is a classic series dance steps for extracting the essential predecessor logic states. The Blakeslee book models this better than I can remember after all these years. The steps are normalize, simplify, flag all the dont't care states and gracefully conceal or wrap the data to handle the physical edges . When you have one of these cubic things, you go through a simplification process. first, you normalize the input data which means remove the numerical clutter and have a single number. Another feature of the extraction process is you sort the truth table output column and input columns and you try to mark as many of the input columns with does-not-matter as possible. A third thing is, you set a limit on the depth of the input data and that means the possible values for an output point are limited because the permuted possibilities are capped, and that cap is usually an exponent like 2^N.
The resulting gadget will be a truth table that grinds out something like x=1 for a=1, b=0, c=1, d=0 and on and on.
Unlike rewriting the software and putting it on a programmable gate array, This is an approach at writing a state table that produces an approximate pixel based on looking at a chunk of images containing that pixel.
One, two, three - crunch! Three licks to get to the center of a tootsie pop! Wait. Wrong question, never mind.
At one level, you say "implement a Power PC or SPARC core" (see, e.g., the LEON cores from http://wwww.gaisler.com/) and just store the code in ROM (for which there are well defined "bit->gate" equations).
If you're looking at implementing the algorithms in a more "hardwarey" or "data flow" method, then it's not so easy.
For instance, implementing a Costas loop in a sampled data system is VERY different when using discrete multipliers and adders than if you are doing it in C.
Example.. if you're doing a digital filter, pretty much every "sample" there's 4 multiplies and 4 adds (for a second order section IIR filter), but you can optimize things like the coefficient precision, etc. There's also decisions about precision.. how many bits do you carry after a multiply?
So, the filter might be:
y(i) = b(0) * x(i) + b(1) *x(i-1) + b(2) *x(i-2) + a(1) * y(i-1) + a(2) *y(i-2).
Practically speaking, this would usually be done something like
y = b0*x + b1*x1 + b2*x2 + a1*y1 + a2*y2
x2 = x1
x1 = x
y2 = y1
y1 = y
Even if you were implementing it in assembler, there's lots of options on how you address the coefficients and inputs.
But you can imagine schemes with registers and a single multiplier and adder, or with shift registers, or with multiple multipliers and adders, and then throw in multiple precision arithmetic, and whether you do ripple carry in a systolic array or synchronous operations with a Cray multiplier and fast adders. Heck, if you're clever you can probably do it in a half dozen gates with a lot of shift registers, and do the adds and multiplies one bit at a time serially. (See, e.g., the Millman and Taub textbook for examples)
Even though VHDL and Verilog resemble C (particulalry VHDL), don't be deluded into thinking you can take your C code and just synthesize it as gates.
give him the code in x86 assembly :p, most probably if he is into hardware, he should get the information he seeks from this.
If he complains that he doesn't understand because he doesn't code, then tell him that you do not design hardware equally, that is the best thing you can come up with.
If all else fails and that you are willing to pimp your arrogance and ego for money, then write a simple perl parser that will parse your program and replace specifically where there are logical decision to their respective gates,
http://www.chem.uoa.gr/applets/appletgates/Images/Image1.gif
:p am no electronic's person, but to solve any problem, a good background study is needed :p.
:p it might be of some use to someone else [the perl parser to convert a language to logic gates]!
I would also if i was to pimp out my arrogance and ego to gain money and succumb to doing something out of what i like to do to design an algorithm using gate, it would be no different than learning a new language, map out the if, add, sub, mul, (division gets a lil bit tricky), from that you can build anything, once you make this map, use a perl parser to parse your program as mentionned and make it generate it according to the map you made.
This might help also http://www.i-programmer.info/programming/hardware/4626-getting-started-with-digital-logic-logic-gates.html
who knows, if you write it out and open source it out
Here is an example of how you can look at this problem. A disk reader for very early computers was all build in TTL (transistor-transistor-logic) hardware. It took over a dozen chips. The early Apple Computer guys (I think Steve Wozniak was one) did it with 3 chips and a bunch of software. Certainly you can store a program in a ROM, or a fusable link, or a FPGA (field programmable gate array), but if you are talking about implementing an algorithm in hardware, you are talking about computation, specifically calculation. That usually means you will need something like (at least part of) an arithmetic logic unit (ALU). If you need multiple divides, you can build one divide circuit, and re-use it, or if you need multiple divides at once, you will have to build more. Ultimately if you are calculating something, you may need intermediate steps (one result acting as input to the next step) will mean a multi-step, multi-time clocked circuit. There are many ways of calculating anything (tautologies), and many ways of implementing an algorithm in code, and likewise many ways of turning that code into circuits. Its an NxN problem.
Remove your C source from your documentation and comments
( which were written first )
now design your hardware.
Go well
As many pointed out, it's a quite complicated question doesn't have a straightforward answer. I used to worked on RTL implementation of graphic algorithms for years and I can say there can be night and days between different implementations of the same algorithm. Also unspecified is their performance requirement. What kind of input your algorithm is expecting? YUV? RGB? CMYK? What's the expecting throughput? How much memory and I/O bandwidth your algorithm is going to take? How many temporary registers are needed? Do they allow deep pipelining and longer latency? What's the fab process they are going to use? There are far too many variables need to be taken into consideration. Also some seemly minor tweaks can bring major differences on both area and speed. I once helped a friend to optimize a supposedly simple error diffusion pipeline. Took us 3 weeks to shrink original design into one third size of original implementation while improved its performance by 15% at the same time. I would say simply tell them you don't know because you are software people unless they are only interested in synthesiz-able codes. Finding someone to write it in RTL for you can be a much heavier burden then you might expected because it's hard to manage something you don't understand. Chances are you may also need to change your algorithms a bit because some operations aren't feasible in hardware with reasonable cost, and some operations can't be removed or simplified.
Compare this to the following situation: A graphical designer draws a fantastic new GUI for an application (on a piece of paper, even). Then you ask him how many lines or kilobytes of code this will be. And then the designer asks: "Can't I just scan the pictures I've drawn and have a software figure this out?". Sounds riddiculus? Yes, but: This is what you wanted.
To answer the original question: The only realistic estimate is to add all the gates you've got in you computer, and take that as an upper bound. Which is still just an estimate, because implementing an algorithm for real-time in hardware can still increase the gate count by leaps and bounds.
To be able to answer such a question you have to re-implement the algorithm in a Hardware Abstraction Language (HAL) like Verilog or VHDL.
I did this in one of our current systems, where I had to process a stream of data.
First I designed an algorithm in C which took an infile, generated an outfile and measured the "quallity" of the output.
Then I re-implemented the algorithm in VHDL, which looks and "thinks" totally different than the original C source (but still DOES the same).
Only after that one can give a realistic estimate (based on the target system/platform and timing constraints) on gate or cell counts.
There is no way you can deliver a sane estimate. You wrote: "Maybe an operation like 'Add' would require 3 gates while an operation like 'Divide' would need 6 gates? Something along those lines, anyway." An operation like "add" uses about 8 gates (give or take a few) for adding a single bit with carry. Adding 32 bit takes significantly more than 32 times that since you want "carry lookahead". Division (64/32 bit, 32bit result) takes a few dozen gates in addition in order to deliver one bit per cycle. Square roots similarly. If you want division to be faster than that, you have to throw in a lot more gates, several orders of magnitude.
And so forth and so on. You are totally out of your depth here. The only sane answer is "We'd have to talk to hardware people. Do you have suggestions?". There is no way you can learn enough about this to deliver an estimate that can be relied upon.
The quality of some of the responses on Stack Overflow would leave me to believe he'd probably get responses telling to count the conditionals and allow one gate per conditional ;) :p
Somewhere in the history of this question is a conversation in which someone there has asked "Can it be done in hardware?" and someone in your business has said "Yes". You are here because that latter person said "Yes" and not "Perhaps, with work" because "Yes" means you already know the answer to this question.
So find that person who said "Yes" and ask them where the hardware design they said you had is. Then count the gates.
I shall give it a go.
First up, most algorithms can't be directly translated to hardware without either changing them or taking a serious performance hit.
Nearly all widespread algorithms (eg. H264 video) are designed specifically with a hardware implementation in mind, and in fact must usually have elements removed that would produce good results simply because it wouldn't be sensible to implement in hardware.
In particular, in hardware, loops that iterate an unknown number of times are generally not allowed.
Steps to make this estimate would probably be to take your code and 'flatten' it (IE. Rewrite it to avoid all use of pointers, except arrays).
For every variable, figure out how many bits wide it needs to be(IE. What is the smallest and largest possible value). You probably want to convert floating point to fixed point.
Next, to make a lower bound of how many gates would be used if you were to design for minimal gate use, take every add and subtract operation and call them 15 gates per bit. For every multiply call it 5 gates per input bit squared. Don't do division (division can be done as a multiplication by the inverse of a number).
For the upper bound, do the same, but multiply by the number of times each loop goes round. That gives you a design with lots more gates but much higher performance.
For the upper bound finally add on 5 gates for every bit of every variable times the number of lines of your input code. This approximates the d type flip flops for storage in a pipeline. Note that if two lines of code operate on entirely different variables, you can call them the same line as far as this metric goes.
For the lower bound, if you got a value greater than 10000 plus 16 times the number of bytes that your program is compiled plus the ram it allocates to run, it would be more gate efficient to put in a tiny processor and keep your algorithm in a ROM. (Lots of complex algorithms are implemented this way when space is at a premium).
Note that these by the way assume you have the engineering time to 'do it properly'. There are lots of ways of making a considerably bigger design, but with much less design effort.
Check out 'Handel c' for example. Its a one click tool that takes C code and produces horribly inefficient hardware, but it works.
If you algorithm is standard just google "FPGA/ASIC IP provider" (of which there are many, eg H264 etc) and pay the price, your results will be optimal and cheaper than doing it yourself - assuming you never have before.
If your alogirthm is custom then you are either going to make a horrible job of it as you learn HW as you go, it takes years to get optimal and reach the clock speeds, area, and QoR/Test coverage numbers needed for production Si. Alternatively you could hire a HW team who will cost you a pretty penny to get it done, or outsource - also not cheap but it allows for risk reduction.
How many gates isnt just a question of counting "*" and "/" and scaling (although back of napkin will give you a general feel). Practically any C/C++ you write is going to be very sequentially orientated, this results in algorithms that probably have better parallel implementations. Whilst you might now be thinking threads, or separate processes we are talking HW parallelism, which is far more fine grained than threads, than SW parallelism. Further any integers you have may be optimizable to less than 32 (eg 1 bit or 3 bits) thus saving a large amount of HW area. Finally you didnt really say what sort of performance you want - if its 1MHz and 1 word / year of I/O then I suspect you could build some very clever hardware to do it all a few gates but if its 1GHz and 1 Giga Words/sec then area might expand as you will need to duplicate the circuit in parallel. Finally the speed at which you process data will affect latency (time from input to the first correct output) which is often a killer in real-time or other systems (eg. If you put LCD TVs side by side you might notice some are running several seconds behind whats being broadcast - this is because as it received the image some are doing more video processing to clean up the image - latency )
So at this point I would recommend looking into SystemC (based around C++), SystemVerilog (so so ) and then a raft of tools to help you do the job. These tools are called "High Level Synthesis" (HLS) tools and they arent cheap but they do cut down on man hours manually converting algorithms, but you will still need to be able to think extremely low level as bad code results in bad gate count - no matter the language.
I dont want to come over as a shill so I am going to present the 4 main competitors for HLS tools;
1) Calypto Systems, formerly Mentor Graphics' tools before spin-out ( http://calypto.com/en/products/catapult/overview )
2) Forte Desgin systems ( http://www.forteds.com/products/cynthesizer.asp )
3) Cadence C2S ( http://www.cadence.com/products/sd/silicon_compiler/pages/default.aspx )
4) Impulse C ( http://www.impulseaccelerated.com/ ) - this is very reasonably priced but has its limitations.
5) There are some open source things out there, I wouldnt recommend them as they are quite in their infancy.
Disclaimer: I used to work at one of the companies that provides synthesis tools, for >10 years converting C/C++/SystemC to HW quite often for a service fee. I can tell you we never had a design that cost under 30K and most were in the 100s to millions of USD.
4004 is a Turing complete machine and has around 2300 transistors. If you code your algorithm for 4004, put it in a ROM and add a 4004 and some RAM to it you'll have your algorithm in hardware (probably working painfully slow). So the question you were asked makes no sense - you'll need more information to give a meaningful answer. In general the same functionality can be coded in a small number of gates and take more time to run or in a bigger number of gates and take less time.
It seems to me that a company asking software developers what it would take in hardware might possibly not know what they want.
Its highly possible that a small CPU and program on flash ROM solution might be all they really want. Do they really NEED it burned into the hardware?
Q: How many computer programmers does it take to screw in a light bulb?
A: Can't be done. It's a hardware problem.
cancel enter clear 123
111111
There's no simple way to answer your question but these guys ( http://www.spacecodesign.com/ ) are developing tools to do automated hardware/software co-design
A simple, direct answer is impossible, but a quick estimate that can often tell you if it's feasible is how much memory the algorithm uses at any given time. If you must iterate over large sets of data you already have an engineering trade on your hands between external vs internal RAM.
Question you should ask yourself:
What is the dynamic range of the data and how accurate of a solution do you need to produce? Are two decimals places enough?
Is all the math floating point or can it be implemented as fixed point or integer math?
Do you have to perform any complex math operations like sqrts or trig? Can these be implemented by look-up tables?
Do you need external RAM chips?
What is the throughput?
Are there latency requirements?
What are the inputs and outputs? Those eat up gates too and if the I/O is a standard you can get your hands on some IP for an estimated gate count.
Will the ASIC be able to talk to a processor and leverage it for complex math operations? Then you don't have to spec, design, and test some crazy cordic math function to sqrts.
I work on implement Matlab algorithms in HDL and it's VERY hard for us to estimate a gate count from pure Matlab. We try to make the problem at little easier by mimicking the fixed point math of our design in Matlab but that only really gives us an idea on dynamic range, noise performance, etc.
You can't call yourself a software developer with a solid understanding of hardware. To develop software, you should always be thinking in terms of how the hardware is going to handle that software. That being said if you need a gate count then use VHDL or Verilog or another Hardware Descriptor Language. You can't actually convert an abstracted software language like C and up to gates because every single compiler and linker will turn out different end code.
"All of them."
"Uh, we really rely upon doing the best software development we can. We don't do hardware implementations of our products so we're not equipped to answer your question. Was there a critical need for this information, such that we need to open a channel to a third-party provider to answer the question? We would only pass along their raw cost to provide an answer at no profit to us."
Translation: We're damn good at what we do. You're asking a question which could be a route to backdooring us and we won't allow that without your paying for the privilege. If you really want the answer, we'll hire someone to get that answer and bill you for that person's time. (Even though you're asking a question you're not entitled to ask unless you're buying the source code as well, which we'd rape you in cost for.)
Make sure that you can analyze the software properly and break it down to well defined functions/modules/whatever-your-abstraction-is
Implement hardware modules for corresponding inputs/outputs but make sure that you do not use an automatic tool. For hardware, you can often
do things very differently since you have different ways to implement things and may use state machines, memories and logic in different ways.
For each module, check hardware and software implementation against each other using CBMC or some other software which can actually verify that your implementation is, if not correct, at least equally bad as the implementation in the other domain.
(Since I'm posting as a Depressive Cyborg, you might be able to figure out what (SW/HW) made me a Cyborg and what made me Depressive....)
If your algorithm is purely arithmetic, then translate it to primitves (+,-,*,/) and estimate complexity based on simple full adders and flip-flps (bit level). Note that this is a very rough estimation and does not apply easilty to long, data-oriented code, since in that case your interest is with the data storage, not the operations on them (imagine adding +1 to a billion-billion-cell vector of counters).
If your algorithm is mixed-form, then you must know your hardware capabilities and, preferrably, its firmware. if you can transform your flowchart (low level) design to assembly code, then you can lookup the necessary opcodes in some standard IC and estimate (again, roughly) the order of the required IC in your case. For example, if you want to sort some 100-element vector of 16-bit integers, then a few generic x86 opcodes are enough - therefore, even 8086 (or a fraction of it) will do the job.
Generally speaking, the algorithm-to-gates estimation works only on very primitive or streamlined procedures, mostly arithmetic. That is, only when we are speaking about DSP (not CPU) implementations, like for GPUs. In almost any other case, most IC circuitry comes from the corresponding memory/heap modules, I/O, registers, etc, as the "algorithm" will require much more than a simple processing unit.
"Abashed the Devil stood, and felt how awful goodness is..."
http://en.wikipedia.org/wiki/SystemC
Cheers,
dan@3-e.net
I would advise them to first try on an FPGA (Field programmable gate array), and just write the program in Verilog, see how many gates it needs, and then simply select an FPGA from Altera or Xilinx that fits your needs. No need for a full blown ASIC.
From what I understood, they are willing to implement your algorithm on the hardware. So, they just need an estimate on the complexity of the hardware. You can cross-compile your code for a RISC-like ISA as a stand-alone executable (mind the ISA, x86 has too much in it). The number of instructions should correlate with the complexity of the implementation of the hardware. In the end, every instruction is a piece of hardware executed sequentially through a control mechanism.
E.g.: (Static MIPS compilation of your application - A) / (static MIPS compilation of X - B) =~ (hardware of your application - C) / (hardware of X - D)
You can use a few (5 min) applications (like fft, jpeg decoder, e.g.) in place of X to find A, B, C, D and see how it matches. You can use simple regression, too, if you have many data points. Getting many data points should not be very hard as there are many applications implemented in both worlds.
If they are serious get some funding to start coding this in a hardware description language.
Note Well: this is a lot like asking how many x86 instructions a "C" program will take
without writing a "C" program. At best this gets you a starting answer.
If you tell the compiler to kill loop unrolling code shrinks and might run slower.
If loops unroll code grows but might run faster. SIMD instructions the code
can shrink. Now ask if the x86 answer is the same answer you get on a ARM
and a MIPS processor. The other thing to know is data path widths have large
impact -- wider is faster but used more gates -- too wide is slow -- too narrow is
slow.
Invest a couple grand of their money on some large FPGA development kits and
go to work. For the most part graphics hardware is tightly coupled stripped down
common processors and state machines setup to solve specific display problems.
One positive place to work is in the world of CUDA on graphics cards.
CAUTION.... the field is full of patents and going fast on CUDA is dancing with
a hungry bear... Any hardware you build to the same end will likely trip on patents
that others have.
Well written hardware descriptions read a lot like any programming language.
With a second beer in hand you can read down from an X-windows program
all the way town to gates and other hardware library stuff and hardly see a
speed bump.
Going fast in hardware requires clever minds....
And if you cannot build Open-GL and WindowZ graphics on top
of your "C" proof of concept you have a lot of work to do.
Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
I think that C/C++ is much too high a level to look at and say "This if statement requires n logic gates" because that if statement will be implemented in assembly differently on different systems. However, one can look at assembly code and say "Ahah, that is 2 logic gates, and there's three more and that's another 3.". So I think you need to compile your code on a whole bunch of systems, compare the assembly, and use some kind of average of the results to get a rough estimate of how many logic gates.
Gates are quite basic (think and/or/xor/not with just a few inputs and one output)
Consider that arithmetic functions (binary add/multiply/increment/latch for a numeric word sized data) can add up to hundreds and thousands of 'gates'
The fact they are refering to 'gates' instead of offering already predesigned functional blocks is a bit worrying (google Programmable Array Logic )
It might be a lot easier to use a single chip microcontroller. I used to use 8048 and 8051 chips, 40 pin cases with built-in memory, rom and I/O. You can even get ones with EPROM to flash your program in.
It all depends on the speed you need, and if you can learn the language to program it. But it's a lot better than designing gate logic...
Wierdy's last suggestion is my personal favorite. It's really a sliding slope between software and hardware anyway. Does putting a Linux ROM with your algorithm set to autoload as a startup daemon in an x86 machine count as hardware or software? Embedded applications often have something resembling an OS, if not a full blown OS, managing resources. Unless your algorithm is super simple, or this electronics manufacturer is a glutton for punishment, I'd put your algorithm on a ROM alongside some DSP or other processing core and call it a day. Another option to explore that's between the two (all gates and ROM/processor combo) is a PAL/GAL... but it will certainly take some mental gymnastics to get your genetic algorithm into a form appropriate for burning the PAL. Good Luck!