Occam's Razor — Amazon's Choice That Cut Costs by 90%

2026-02-24 · # AI 개념

In the 14th century, English Franciscan friar William of Ockham said: “Entities should not be multiplied beyond necessity.” This single sentence dominated scientific, philosophical, and statistical thinking for centuries. When two hypotheses explain the same phenomenon, choose the one requiring fewer assumptions. This is Occam’s Razor.

But the people who violate this principle most often are software developers and AI engineers.

Software Design: Over-Investment for the Future

In software engineering, Occam’s Razor was reborn as YAGNI (You Aren’t Gonna Need It). This principle from Extreme Programming (XP) methodology was uncompromising: Don’t build features you don’t need now. Don’t build features you ‘might’ need later.¹

However, real engineering organizations often moved in the opposite direction. They designed architectures handling a million requests per second for services getting 100 requests per second. They applied microservice architecture to internal tools with 10 users. “Considering scalability” became the most common excuse for over-engineering.

Microservices were a prime example of this phenomenon. When Netflix and Amazon succeeded with microservices, teams with completely different scales and contexts adopted this architecture. Service-to-service communication costs, distributed transaction management, and deployment pipeline complexity grew exponentially. They split problems that one monolith could handle into dozens of services, then spent more time coordinating those services than solving the original problem.

In 2023, ironically, a counterexample came from Amazon — the home of microservices. Amazon Prime Video’s team transitioned their audio/video quality monitoring service architecture from distributed microservices to a single integrated service, cutting infrastructure costs by 90%². The distributed structure’s service-to-service data exchange had become the bottleneck. This was a textbook case where more complex design produced worse results.

Around the same time, data integration platform Segment also reverted from microservices to monolith. Running hundreds of destination integrations as independent services reached unsustainable operational burden³. Returning to simpler structure improved both deployment speed and stability.

The Cost of Abstraction: As Indirection Layers Stack Up

Another trap in software design was abstraction abuse. There’s a saying that “all problems can be solved by adding another layer of indirection,” but the second half that David Wheeler added is less quoted: “except the problem of too many layers of indirection.” When interfaces, factories, adapters, and middleware stack up, code readers get lost in abstraction mazes before reaching actual logic. Debugging time increases, new team member onboarding slows. Abstraction hides complexity rather than removing it.

AI and Machine Learning: The Temptation of Complex Models

In machine learning, Occam’s Razor was mathematically defined more precisely. The bias-variance tradeoff was exactly that. Simple models had high bias but low variance, operating stably on new data. Complex models had low bias but high variance, prone to overfitting — learning training data noise.

This principle was also formalized in information theory. Solomonoff induction translated Occam’s Razor into algorithmic information theory language. When multiple hypotheses explain data, assign higher prior probability to hypotheses with shorter algorithmic descriptions⁴. Making ‘simplicity’ measurable through Kolmogorov complexity. Simplicity was mathematical principle, not intuition.

Yet the dominant trend in 2020s AI ran directly opposite to this principle. The Scaling Laws paper published in 2020 by OpenAI’s Jared Kaplan and others showed performance improved predictably as model size, data amount, and compute increased⁵. A power law relationship was observed where increasing parameters 10x reduced loss by a consistent ratio. This discovery justified the AI industry’s arms race of “bigger, more.”

However, Occam’s Razor worked within scaling laws too. The Chinchilla paper published in 2022 by DeepMind’s Jordan Hoffmann and others showed existing large language models were severely undertrained relative to their parameter count⁶. 70B-parameter Chinchilla trained on more data outperformed 280B-parameter Gopher. On MMLU benchmark, Chinchilla achieved 67.5% accuracy, beating Gopher by over 7%. Optimally distributing a given compute budget mattered more than growing models. A smaller model had beaten a larger one.

GPT-4 or Rule-Based Systems?

The most common Occam’s Razor violation encountered in practice happens in tool selection. For customer inquiry classification tasks where keyword matching and a few regex lines achieve 90% accuracy, teams often try to build large language model (LLM) pipelines instead. Accepting all the API costs, latency, hallucination possibilities, and prompt management complexity.

This wasn’t specific to particular problems. Comparative studies in industrial monitoring showed rule-based systems achieved equal or better performance than data-driven models in environments with clear domain knowledge, while integration and maintenance costs were much lower⁷. A few hardcoded rules execute directly on PLCs, while ML models carry additional costs of data collection pipelines, model update mechanisms, and hardware requirements.

Gerd Gigerenzer’s research provided academic foundation for this intuition. His “Simple Heuristics That Make Us Smart” (1999) systematically showed cases where simple decision rules using only partial information had higher predictive accuracy than complex statistical models⁸. This paradox where less information systematically creates better predictions was called the ‘less-is-more effect.‘

Decision Costs: The Paradox of Choice

Occam’s Razor also operated in technology choice domains. In 1952, British psychologist William Edmund Hick revealed the relationship between number of options and reaction time through stimulus-response experiments. Decision time increased logarithmically with more options⁹. This was Hick’s Law.

Consider choosing frontend frameworks to see this law’s power. React, Vue, Svelte, Angular, Solid, Qwik, Astro — options kept growing, and time comparing their pros and cons grew too. Analysis paralysis in technology choice often consumed more time than actual development. Which framework you choose mattered less than choosing quickly and starting implementation in most projects.

Database selection was similar. PostgreSQL, MySQL, MongoDB, DynamoDB, CockroachDB, PlanetScale — the list was endless. But PostgreSQL alone sufficed for most early projects. It handled relational data, JSON documents, full-text search, and geospatial queries. Worries about “what if scaling problems arise later?” rarely materialized. Projects successful enough to have scaling problems also had resources for migration when needed.

When Simplicity Isn’t the Answer

Reading this far, it’s easy to conclude “just always choose the simpler option.” But Occam’s Razor never said “the simplest explanation is always right.” It said “don’t increase complexity unnecessarily.” The key is the conditional clause ‘unnecessarily.’

As scaling laws showed, there actually exist problems in AI that require increasing complexity. For tasks like natural language understanding, code generation, and multilingual translation, large models with hundreds of billions of parameters showed performance levels unreachable by rule-based systems. No one claimed autonomous vehicle perception systems could be implemented with if-else statements.

Software architecture was similar. Netflix’s microservices weren’t over-engineering because organizational needs for thousands of engineers to deploy independently simultaneously were real. According to Conway’s Law, system structure reflects organizational structure. If organizations are sufficiently large and complex, architecture must be correspondingly complex.

The real problem wasn’t complexity itself but unjustified complexity. Putting Kubernetes on a service used by 10 people, building deep learning pipelines for problems needing 3 classification rules, introducing distributed cache for APIs getting traffic once a week. These decisions violated Occam’s Razor.

How to Wield the Razor

Occam’s Razor is a thinking tool, not a blind rule. Applying it practically is relatively clear.

First, start with the simplest solution meeting current requirements. Adding complexity when needed is always easier than simplifying something built complex from the start.

Second, explicitly calculate costs when adding complexity. Separating new services brings costs of network calls, serialization/deserialization, and failure propagation possibilities. Introducing new abstraction layers increases cognitive load. Check if these costs are smaller than resulting benefits.

Third, “might need it later” is the most dangerous justification. To quote YAGNI principle creator Ron Jeffries, not implementing now saves costs for features that are never needed, and when really needed, you can implement better with more context.

Ultimately, Occam’s Razor says this: Complexity isn’t free. Every additional parameter, every additional service, every additional abstraction carries invisible costs. Maintenance costs, cognitive costs, decision costs. Accept complexity only when you have clear reasons to pay those costs.

Simplicity wasn’t laziness. It was the courage to reject unjustified complexity.

Beck, K. (2004). “Extreme Programming Explained: Embrace Change” (2nd ed.). Addison-Wesley. ↩
Kolny, M. (2023). Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%. Amazon Prime Video Tech Blog. https://www.primevideotech.com/video-streaming/scaling-up-the-prime-video-audio-video-monitoring-service-and-reducing-costs-by-90 ↩
Segment Engineering Blog (2020). Goodbye Microservices: From 100s of Problem Children to 1 Superstar. InfoQ. https://www.infoq.com/news/2020/04/microservices-back-again/ ↩
Solomonoff, R. J. (1964). A Formal Theory of Inductive Inference. Information and Control, 7(1), 1–22. https://doi.org/10.1016/S0019-9958(64)90223-2 ↩
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. https://arxiv.org/abs/2001.08361 ↩
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … & Sifre, L. (2022). Training Compute-Optimal Large Language Models. Advances in Neural Information Processing Systems, 35, 30016–30030. https://arxiv.org/abs/2203.15556 ↩
Müller, F. et al. (2025). A Comparative Study of Rule-Based and Data-Driven Approaches in Industrial Monitoring. arXiv preprint arXiv:2509.15848. https://arxiv.org/abs/2509.15848 ↩
Gigerenzer, G., Todd, P. M., & ABC Research Group. (1999). “Simple Heuristics That Make Us Smart”. Oxford University Press. ↩
Hick, W. E. (1952). On the Rate of Gain of Information. Quarterly Journal of Experimental Psychology, 4(1), 11–26. https://doi.org/10.1080/17470215208416600 ↩