A perceptron is the ancestral form of modern machine learning systems. The artificial neural networks that we use today throughout the world of artificial intelligence — from Deep Learning to image generation to Large Language Models (LLMs) — all derive directly from the perceptron. In fact, modern neural networks are often called "multilayer perceptrons". The perceptron is a very simple algorithm, and understanding it will help you understand how today's extraordinary AIs, like ChatGPT or Midjourney, work on a fundamental level.

Perceptrons were first developed "on paper" by neuroscientist Warren S. McCulloch and mathematician Walter Pitts in 1943. (This was two years before the first true computer, the ENIAC, even came online!) The concept of the perceptron came from observations about the wiring patterns of neurons in human and animal brains — specifically, the Hebbian learning rule, which tells us that neurons that fire simultaneously tend to develop stronger connections to one another. An early AI pioneer, a psychologist named Frank Rosenblatt, using an engineering paradigm of corrective feedback signals that came to be called "cybernetics" , built the first electronic perceptron in 1958. His invention was an analog machine comprised of self-adjusting "neuronal units" — each of which consisted of a motor that turned its own resistance dial to set its connection weight.
As the world's first machine capable of "learning", the perceptron caused a storm of inspiration in both scientific and cultural circles. The ubiquitous depiction of whirring, clacking sentient robots in movies and TV from the 1960s, with electronic brains made of relays and vacuum tubes, was due in no small part to the incredible achievements that AI engineers were demonstrating in the real world. The perceptron could perform tasks that many had considered impossible for a machine to do, such as recognize images, read handwriting, and eventually diagnose cancer. Rosenblatt himself famously declared, "[The perceptron is] the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
Unfortunately, this whirlwind of excitement proved immature. In the late 1960s, it was becoming apparent that the perceptron could only learn a very specific format of problems: ones that exhibited a mathematical property called linear separability. This constraint meant that a perceptron could only identify a pattern as long as that pattern could always strictly be described by the sum of its parts — that is, as long as no combination of individual examples of the pattern constitute an exception to it. For instance, if a perceptron were being trained on lists of food ingredients to recognize the property of "deliciousness", then it could learn that chocolate and tuna are each individually delicious, but it could never discern that the combination of chocolate and tuna isn't the sum of chocolate's and tuna's individual deliciousness values. (This example showcases the concept of an "exclusive OR", or "XOR", operation, which became the quintessential demonstration of something that a perceptron cannot do.) If a pattern recognition assignment couldn't be expressed in linearly separable terms, then a perceptron was simply mathematically incapable of ever learning the pattern — no matter how many neurons it had, no matter how densely connected it was, and no matter how many training examples it was shown.
In 1969, legendary AI scientist Marvin Minsky, a lifelong rival of Rosenblatt since boyhood, co-authored a highly critical book with mathematician and educator (and eventual philanthropist) Seymour Papert titled Perceptrons: An Introduction to Computational Geometry. This book presented a rigorous and elegant proof of the perceptron's limitations — a proof so straightforward, in fact, that many researchers were shocked and frankly embarrassed that the perceptron's shortcomings weren't obvious to them the whole time.

Frank Rosenblatt working on automated image recognition, attaching a camera system to his Mark 1 Perceptron at Cornell University. (Photo courtesey of Wikimedia Commons.)
Minsky and Papert's book had a profound — arguably even catastrophic — influence on the history of artificial intelligence.
On the one hand, this revelation devastated the AI community. Research into "connectionist" approaches to AI — those based on or inspired by the concepts of neural connectivity from biological brains — practically disappeared overnight. Through the 1970s and early 1980s, partly due to Minsky's advocacy, research instead focused on the development of symbolic artificial intelligence, which included expert systems and automated mathematical theorem-provers. These systems could explicitly mimic human decision-making and possibly even exceed human performance. However, they had very little means of adapting to novel situations, and were incapable of doing anything that their human programmers didn't explicitly tell them to. During this time, connectionism was generally regarded as naive and perhaps childish — but at the same time, practically nobody believed that expert systems would ever lead to "self-aware" machinery. This was known as the AI winter, and during this time many people both in industry and in culture came to believe that the development of a "true" machine intelligence was ultimately impossible.

Marvin Minsky, co-founder of the
MIT AI Lab, advocated a radically different approach to AI than Rosenblatt. He and
Seymour Papert published
Perceptrons in 1969. This book crushed all faith in neural networks for an entire generation of researchers, but it also described the innovations that would need to take place — and ultimately
did take place — to bring connectionist AI to fruition.
On the other hand, Minsky and Papert left a glimmer of hope amid this flood of despair. In the book, they posited that, even though individual perceptrons had this limitation of linear separability, combined stacks or layers of perceptrons would not. In a layered configuration, each layer of the machine could be trained to recognize groupings of a pattern's individual compoments, and then assign separate connection weights to the group independently from the weights of the original components. Per the above example of training a perceptron to learn the property of "deliciousness" in chocolate XOR tuna (that is, in chocolate or tuna but not both), the simultaneous activation of the two inputs representing "chocolate" and "tuna" could trigger the activation of a third implicit or "hidden" unit that represents "chocolate-AND-tuna". This "chocolate-AND-tuna" node could then have a strongly negative connection to the "deliciousness" output — so strong that it completely suppresses whatever positive activation might otherwise come from the sum of the two "chocolate" and "tuna" inputs individually.
Unfortunately, the question still remained about how exactly to train a multilayered perceptron. Training a single-layered perceptron is easy — it simply involves increasing the weights of the connections from active inputs to the desired output, and decreasing the weights of the connections from active inputs to erroneous outputs. In the multilayered case, the machine first needs to learn that "special-case" or "exception" groupings exist; then it needs to learn which group activations contribute to or detract from which outputs; and, lastly, it needs to learn which inputs should merge into which groupings. If these groupings were hard-wired by human programmers, then the training problem reduced to that of the single-layer case — but it also defeated the purpose of having a machine that could learn patterns on its own. The challenge of how to get a machine to discover and assemble these groupings by itself remained unresolved.
The answer finally came in 1986, when three researchers — psychologists David Rumelhart and Geoffrey Hinton and computer scientist Ronald J. Williams — revealed an algorithm they called "backpropagation". As its name suggests, the backpropagation algorithm involves assigning portions of an error signal, or "blame" for an incorrect result, incrementally across multiple chains of processing units backwards from output to input. This concept was not original to this trio; the operating principles of backpropagation stretch back at least to the work of aeronautics engineer Henry J. Kelley in 1960, and its earliest software implementation to mathematician Seppo Linnainmaa in 1970, with many contributors making many improvements in subsequent years. However, this principle of backpropagation (which previously went by various different unwieldy names, such as "reverse accumulation in automatic differentiation") was intended for use in control theory, primarily in the domain of aerospace engineering — Kelley's original paper, for example, demonstrated how to use the technique to compute an orbital transfer trajectory for a spacecraft flying via solar-sail propulsion. It had little to do with computer science. A few people had discussed the possibility of applying these ideas to the task of training multilayered perceptrons — most notably sociologist Paul Werbos in his 1974 dissertation for his Ph.D. in statistics. But Rumelhart, Hinton, and Williams were the first to experimentally demonstrate it. Computer scientist Yann LeCun, in 1987, produced significant refinements both to the implementation and to the theoretical framework for using backpropagation in this manner. It is from this work that the technology we now call "neural networks" took shape.
If you''d like to see how backpropagation works on a conceptual level, I recommend you start with this excellent video series by 3Blue1Brown (no affiliation). If you have some programming expertise and want to understand how to turn the equations into code, Neptune.ai (no affiliation) has a detailed lesson.
Though these developments in the mid to late 1980s breathed new life into the study of neural networks, they didn't end the AI winter. The equations of backpropagation are extremely complex both for computers and for most humans, requiring the solving of partial derivatives in arbitrarily high-dimensional hyperspaces. Rumelhart et. al. had found a solution to the problem of training multilayered perceptrons, but the solution required tremendous computing power in order to do anything useful. Interest in neural networks rose again in the 1990s, only to dwindle back down again in the 2000s when little commercial application could be found due to extremely slow and error-prone real-world performance. Tasks such as real-time speech or image recognition were far beyond the processing capabilities of consumer-grade hardware — and offloading neural network operations to data centers was infeasible because, with broadband still uncommon and cellphones still primarily being voice communication devices with limited data capabilities, the network infrastructure for cloud computing didn't exist at the time. During this period, neural networks were generally regarded as laboratory tools, useful for exploratory analysis and investigation rather than end-goal solutions — for example, a hedge fund might run a large neural network on a supercomputer to identify a trading pattern for some set of stocks, but then the hedge fund would build an expert system to actually perform live trades using the pattern. In the apocryphal words of researcher John S. Denker from around 1994, "A neural network is the second best way to solve any problem. The best way is to actually understand the problem."
It's long been known that computing power grows on an exponential trend, so it was arguably inevitable that backpropagation-trained neural networks would become more viable for bigger problems over time. However, two noteworthy developments accelerated the timetable considerably.

The logistic sigmoid function (a functionm commonly used for neuronal activation prior to 2011), compared to ReLU. (Image courtesy of ResearchGate)
One was the adoption of the ReLU activation function — a tiny technical detail with enormous implications. For multilayered perceptrons to be able to solve linearly inseparable problems, the activity level of each neuron can't simply be the sum of its inputs. The "activation function" is the function that converts the neuron's input summation into an activity level. Since the adoption of backpropagation, neural networks have tended to use activation functions that come from backprop's roots in control theory — which tend to be extremely complex trigonometric or inverse-exponential functions requiring floating-point computation to many digits of precision. Not only were these functions costly both in memory and in CPU cycles, they were also rather poor at doing their only job: propagating corrective signals iteratively up a chain of neural layers. In 2011, researchers at the University of Montréal showed that a ridiculously simple formula called the Rectified Linear Unit, or ReLU, solved all of the propagation problems of more conventional activation functions, while also requiring much less precision (i.e. less memory) and being dramatically easier to compute. (The ReLU function is literally just this: "Set the neuron's activity level to the sum of its inputs; but if it drops to negative, floor it out at 0.") The ReLU function isn't "mathematically pure" for the purposes of usage in backpropagation (specifically, it has a point, x=0, at which it's nondifferentiable), and as such, though it had been known about since 1960, it had often been overlooked for use in neural networks. However, the Montréal team's results were undeniable, and the use of ReLU for Deep Learning has been nearly ubiquitous ever since.
The other major advancement was a hardware innovation that had been growing for decades, and finally came to fruition in 2007. That was the year that NVIDIA, the company that produces a significant share of the world's graphics cards, released the Compute Unified Device Architecture, or CUDA, a framework for developing general-purpose software to run directly on graphics processors. Driven almost entirely by the demands of the video game industry, Graphics Processing Units, or GPUs, have evolved into astonishingly powerful workhorses for massively parallel mathematical operations. The equations that drive the rendering of visually realistic environments largely describe the rotation, scaling, and translation of points in space, and these equations can be expressed in terms of linear algebra — that is, vector and matrix (or "tensor") multiplication, addition, and unit-wise summation ("reduction"). Thus, GPUs aren't good at performing general-purpose mathematical operations, but they excel at executing incredibly large tensor operations at blinding speeds. As it so happens, almost every equation involved in the running and training of neural networks can be expressed as a tensor operation. (In fact, the only part of neural networks that can't be expressed with linear algebra is, as described above, the activation function — which the Montréal team optimized!) Unsurprisingly, the migration of all neural network research to GPUs began almost immediately.
Unfortunately, CUDA also allowed massive parallelization for another computationally expensive task: cryptography. And with Bitcoin launching in 2009 followed by Ethereum in 2015, it wasn't long before the world's supply of GPUs was being primarily directed towards crypto mining — not gaming, and certainly not AI research.

Counting from the day it launched, ChatGPT reached a million registered users in five days. The closest speed of mass adoption was achieved by Instagram, which took 2.5 months to reach the one-million user mark. (Image courtesy of Statista)
Though the crypto market has crashed, computing technology is still not at the point at which individual consumers can afford to run their own Large Language Model rigs. The unprecedented popularity of ChatGPT — jumping to a million active users in a mere five days, is due primarily to the uncannily (some would say terrifyingly) high quality of its output. And that quality, in turn, is due primarily to the sheer size of its neural network components. Its famed 175 billion neural connection weights (called "parameters" for reasons of mathematical nomenclature) need over 700 GB of RAM just to load into memory. One analyst estimates that the servers running each instance of ChatGPT probably have between 5 to 8 GPUs apiece, which would cost a private consumer tens of thousands of dollars — achievable, but only for either the very dedicated or the very wealthy (these kinds of servers are sold on the kinds of websites that don't list prices). Nonetheless, this would only permit real-time interaction with an already-trained model. The training process for GPT-3 involved storing and processing over 45 terabytes of text (a volume of data that would take over a month of continuous downloading to transfer over the average home broadband connection) and burning as much electricity as an average American home uses in the course of 18 years (190,000 kWh). The day when the average consumer can carry around their very own personal learning-enabled GPT instance in their pockets isn't around the corner.
But it's not far off, either.
EDIT: In the time it took for me to publish this website, much of the above paragraph has been rendered obsolete. OpenAI released GPT-4, which allegedly is slightly bigger than GPT-3. And Meta, the company that owns Facebook, has produced LLaMa, a language model that is comparably performant to GPT but that can be made to run on a desktop workstation. We are living in very interesting times.
The feedforward multilayer perceptron design, with backpropagation as its training algorithm, remains the overwhelmingly dominant neural network architecture to this day, and its capabilities continue to improve at an exponential rate. As you read these words, somewhere at this moment there are hardware engineers using neural networks to design the next generation of computing technology, and software developers using ChatGPT to help them code the next breakthrough in machine learning algorithms. Software toolkits such as Google's TensorFlow library and rentable cloud computing environments such as Amazon's AWS Sagemaker allow aspiring Machine Learning programmers to quickly and easily get started with building their own neural networks without needing a prerequisite doctorate, thus bringing millions of new creative minds to the field of AI research. Frank Rosenblatt's prophecy that his analog contraption would be the "embryo" of sentient machinery may yet come to fruition — and soon.