What IT services does KGA provide?

KGA provides comprehensive IT support services including software installation and setup, SaaS system maintenance, application configuration, technical support, digital consulting (including website development), security services, and data management & backup solutions.

What areas do you cover?

Based in Kosai, Shizuoka, we provide remote support nationwide across Japan. On-site support is available primarily in the Tokai region.

Can I consult before signing a contract?

Yes, initial consultation and estimates are completely free. We will listen to your IT challenges and propose the optimal solution.

Is emergency support available?

Yes, the Premium plan includes 24-hour emergency support. The Standard plan also provides priority response during business hours.

Can you set up international TV apps?

Yes, we support the installation and configuration of international TV applications and media players. We help set up environments for legal access to international content.

Do you offer multilingual support?

We support 9 languages: Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish.

Are there any setup or hidden fees?

No. All prices displayed are final and tax-included. There are no setup fees, hidden charges, or surprise invoices. What you see is exactly what you pay.

Can I change plans later?

Yes. You can upgrade, downgrade, or cancel at any time. Upgrades take effect immediately and we will prorate the difference. Downgrades take effect at the next renewal cycle.

Which payment methods do you accept?

We accept all major credit cards (Visa, Mastercard, JCB, American Express) through Komoju, as well as bank transfers and convenience store payments in Japan. Invoicing is available for Business IT Plan customers.

Do you offer refunds?

Yes. We offer a 14-day money-back guarantee on all annual plans — no questions asked. Monthly Business IT Plan subscriptions can be cancelled at any time with prorated refunds for unused service.

What is the difference between the annual plans and the Business IT Plan?

Annual plans cover app configuration and support for individuals and small teams. The Business IT Plan is a comprehensive monthly subscription for companies that require website development, system management, automation, security, and a dedicated account manager.

Do you provide support in English?

Yes. Our team provides full multilingual support in Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish — by email, chat, and scheduled video calls.

Chaos Engineering in Production 2026: Gremlin, LitmusChaos, Chaos Mesh, and Japanese Enterprise Game Day Culture — KGA Tech Blog

Chaos Engineering Is "Verification," Not "Destruction"

When Netflix open-sourced Chaos Monkey in the 2010s, chaos engineering was received as a radical practice of randomly breaking production. Its cultural impact was real, but so was the misunderstanding it created: breaking production requires organizational maturity, and the bar was too high for most companies.

The 2026 understanding is clear. Chaos engineering is not "technology for breaking things" — it's a discipline for verifying resilience. Its essence is conducting a scientific experiment to confirm that a system designed to handle failures actually handles them as intended. Form a hypothesis, run a small test, observe, improve. The entire PDCA cycle is chaos engineering. Framing it as an extension of TPM (Total Productive Maintenance) or FMEA (Failure Mode and Effects Analysis), both widely practiced in Japanese organizations, tends to make adoption much easier to explain.

Tool Selection: Gremlin / LitmusChaos / Chaos Mesh

As of 2026, the major tools have settled into three camps. Gremlin is the commercial SaaS leader, with "Blast Radius control" and "Halt conditions" that allow safe adoption even without a dedicated SRE team. It ships dozens of standard attacks — network latency, CPU saturation, disk I/O exhaustion, availability zone failures — all triggerable from a web UI in a few clicks. It's strongest in organizations where chaos engineering is required for regulatory compliance, such as finance and telecom.

LitmusChaos is a CNCF Graduated project and Kubernetes-native OSS. Chaos Experiments are defined as CRDs, and community experiments can be imported from the ChaosHub. Its GitOps integration is outstanding: combine it with Argo CD to build a complete pipeline — "declare experiments in Git, execute with Argo Workflows, validate metrics with Prometheus, auto-rollback on failure." Best suited for organizations with mature SRE discipline.

Chaos Mesh is a CNCF Incubating project from PingCAP, specializing in fine-grained fault injection on Kubernetes. Its CRDs cover PodChaos, NetworkChaos, IOChaos, TimeChaos, and DNSChaos, making it particularly capable of reproducing low-level failures that other tools struggle with — NTP drift, DNS poisoning. The Chaos Dashboard UX is polished, and the learning curve is gentler than LitmusChaos.

Kubernetes Experiments: The Three-Tier Model

In Kubernetes environments, organizing experiments into three tiers has become the standard pattern. Tier 1 is the Pod level. Use PodChaos to kill one Pod in a specific Deployment and verify that HPA and service mesh retry/circuit breaker behavior works as expected. Minimal blast radius; can run daily as an automated job.

Tier 2 is the Node level. Use NodeChaos to drain one node in an AZ and verify that Pod Disruption Budgets and rescheduling function correctly. Weekly cadence is a reasonable target. Tier 3 is the Region/AZ level: use NetworkChaos to partition inter-zone traffic and validate multi-AZ failover. This is game-day scale — quarterly.

In 2026 production environments, the prevailing pattern is to fully automate tiers 1 and 2 (integrated into CI pipelines) and keep tier 3 as a human-involved game day. The automated portions call LitmusChaos from GitHub Actions or Tekton, with a halt condition attached that automatically stops experiments when an SLO violation is detected on the monitoring dashboard.

Validating Multi-Region Failover

For services running across multiple regions, an annual or biannual "region switchover game day" has become a de-facto required practice. The process has three steps. First, gradually shift DNS traffic from the primary region to the secondary (10% → 50% → 100%), checking for SLI degradation at each step and rolling back if found. After 100% cutover, intentionally network-partition the primary region with NetworkChaos and observe whether the secondary region can sustain the load independently for several hours.

What these game days consistently surface: (1) insufficient capacity in the secondary region (HPA limits set too low because it normally sees low traffic), (2) database replica lag spikes under write-heavy traffic, (3) third-party API regional restrictions (endpoints unreachable from certain regions, or tighter rate limits). These are exactly the kinds of problems you can't anticipate without testing — they would only appear for the first time in a real disaster.

Cultural Challenges in Japanese Organizations: "Planned Downtime" vs. "Production Chaos"

The most serious barrier to adopting chaos engineering isn't technical — it's cultural. Japanese organizations have a tradition of "planned downtime" (brief service interruptions announced in advance), but there's strong resistance to intentionally injecting faults into a live running system. "Why would we break production?" is not an unusual reaction from executives.

Three practical tactics for clearing this barrier. First, start with "chaos in non-production environments." Run PodChaos against staging daily and build a culture of validating SLOs before release. Being able to say "we're not doing this in production" while accumulating six to twelve months of results is valuable.

Second, when moving to production, use an existing planned maintenance window. Run small-scale chaos during the monthly maintenance window and frame it as part of the planned downtime. This avoids creating a new executive approval process from scratch.

Third, brand the game day as "training." Call it a "failure response drill" or "BCP exercise" and it tends to earn positive reception from quality assurance and internal audit teams rather than resistance. Game days can legitimately serve as evidence of meeting exercise requirements under ISO 22301 (Business Continuity Management) and FISC security standards.

Maturity and Next Steps

The Chaos Engineering Maturity Model (proposed by Casey Rosenthal, 2026 revision) defines five levels from Level 1 (ad hoc experiments) to Level 5 (full production automation, continuously running). Most Japanese companies are currently in the transition from Level 2 (regular staging experiments) to Level 3 (production game days). Reaching Level 4 (automated production experiments) requires a complete set of SLOs, Error Budgets, and observability — specifically, multi-window multi-burn-rate alerting must already be functioning.

At KGA IT, we design "chaos engineering adoption roadmaps" in 6–18 month increments as part of our SRE maturity assessments for clients. The order in which you overcome cultural resistance matters more than tool selection, and whether to use Gremlin or LitmusChaos is honestly a later-stage question.

Chaos Engineering in Production 2026: Gremlin, LitmusChaos, Chaos Mesh, and Japanese Enterprise Game Day Culture