๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Server

Grafana + Slack ์•Œ๋ฆผ ์„ค์ •ํ•˜๊ธฐ

 

๐Ÿ™‹ ๋“ค์–ด๊ฐ€๋ฉฐ

์ด์ „ ๊ธ€์—์„œ ๊ตฌ์ถ•ํ•œ ๋ชจ๋‹ˆํ„ฐ๋ง ์Šคํƒ(Prometheus + Loki + Grafana)์„ ํ™œ์šฉํ•ด ์žฅ์•  ์ƒํ™ฉ์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ Slack์œผ๋กœ ์•Œ๋ฆผ๋ฐ›๋Š” ๋ฐฉ๋ฒ•์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ธ€์—์„œ ๋‹ค๋ฃฐ ๋‚ด์šฉ์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Slack Webhook ์—ฐ๋™ (Contact Point ์„ค์ •)
  • ์„œ๋ฒ„ ๋‹ค์šด ์•Œ๋ฆผ (Prometheus Alert)
  • ERROR ๋กœ๊ทธ ์•Œ๋ฆผ (Loki Alert)
  • ์ถ”๊ฐ€ํ•˜๋ฉด ์ข‹์„ ์•Œ๋ฆผ ๋ชจ์Œ

๐Ÿ”” 1. Slack ์•Œ๋ฆผ ์„ค์ •

Grafana์˜ Alert Rules์—์„œ ์กฐ๊ฑด์„ ์„ค์ •ํ•˜๊ณ , Contact Points์—์„œ Slack Webhook URL์„ ๋“ฑ๋กํ•˜๋ฉด ์•Œ๋ฆผ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Slack ์—์„œ Webhook URL ๋ฐœ๊ธ‰๋ฐ›๊ธฐ

Slack Marketplace ์ ‘์†

incoming webhook ๊ฒ€์ƒ‰

Slack์— ์ถ”๊ฐ€ ํด๋ฆญ

์•Œ๋ฆผ์„ ๋ฐ›๊ณ ์ž ํ•˜๋Š” ์ฑ„๋„ ์„ ํƒ

์›นํ›„ํฌ URL ๋ณต์‚ฌํ•ด๋‘๊ธฐ

Slack Contact Point ์„ค์ •

Grafana > Alerting > Contact Points > Add contact point์—์„œ Slack์„ ์„ ํƒํ•˜๊ณ  ์œ„์—์„œ ๋ณต์‚ฌํ•ด๋‘” Webhook URL์„ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

Alert rules > New alert rule > 5. Configure notifications ์— ์ถ”๊ฐ€ํ•œ contact point๋ฅผ ์„ ํƒํ•˜๋ฉด ์•Œ๋ฆผ์ด ํ•ด๋‹น ์ฑ„๋„๋กœ ์•Œ๋ฆผ์ด ๊ฐ‘๋‹ˆ๋‹ค.


์„œ๋ฒ„ ๋‹ค์šด ์•Œ๋ฆผ (Prometheus Alert)

Prometheus์˜ up ์ง€ํ‘œ๊ฐ€ 0์ด ๋˜๋ฉด(์„œ๋ฒ„ ์‘๋‹ต ์—†์Œ) Slack์œผ๋กœ ์•Œ๋ฆผ์„ ๋ฐœ์†กํ•ฉ๋‹ˆ๋‹ค.

Grafana > Alerting > Alert Rules > New alert rule์—์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

up{instance="<prod ์„œ๋ฒ„ ip>"}

ERROR ๋กœ๊ทธ ์•Œ๋ฆผ (Loki Alert)

์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ERROR ๋กœ๊ทธ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์•Œ๋ฆผ์„ ๋ฐœ์†กํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด ์—๋Ÿฌ ์•Œ๋ฆผ์€ ๋ฉ”์‹œ์ง€๋งŒ ๋ณด๊ณ  ์ •ํ™•ํ•œ ์›์ธ์„ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์›Œ ์•Œ๋ฆผ ๋ฉ”์‹œ์ง€ ๋ณธ๋ฌธ์„ ์ปค์Šคํ…€ํ•˜์—ฌ ๊ฐœ์„ 

 

์ปค์Šคํ…€ ์•Œ๋ฆผ ๋ฉ”์‹œ์ง€๋ฅผ ์œ„ํ•œ LogQL ์ฟผ๋ฆฌ

count_over_time(
  {service_name="goody-api"} 
  | json 
  | detected_level="ERROR" 
  | line_format "{{ .attributes_exception_stacktrace }}"
  | regexp "(?m)^\\\\s*at\\\\s+(?P<error_location>org\\\\.re\\\\.goody\\\\.[^\\\\r\\\\n]+)"
  | label_format 
      exception_message="{{ .attributes_exception_message }}",
      error_location="{{ .error_location }}",
      exception_type="{{ .attributes_exception_type }}",
      body="{{ .body }}"
  | __error__=""
[5m])

 

ํ”„๋กœ์ ํŠธ์˜ ์‹œ์Šคํ…œ ์—๋Ÿฌ๋Š” ๋Œ€๋ถ€๋ถ„ AppException, InfraException์œผ๋กœ ๋žฉํ•‘๋˜๊ธฐ ๋•Œ๋ฌธ์— exception.type ํ•„๋“œ๋งŒ์œผ๋กœ๋Š” ๊ตฌ์ฒด์ ์ธ ๋ฐœ์ƒ ์›์ธ์„ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์› ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด exception.stacktrace์—์„œ org.re.goody๋กœ ์‹œ์ž‘ํ•˜๋Š” ํ”„๋กœ์ ํŠธ ๋‚ด๋ถ€ ํŒŒ์ผ ์œ„์น˜๋ฅผ ์ •๊ทœ์‹์œผ๋กœ ํŒŒ์‹ฑํ•˜์—ฌ error_location ๋ ˆ์ด๋ธ”๋กœ ์ถ”์ถœํ–ˆ์Šต๋‹ˆ๋‹ค.

{
	"body":"[Unexpected Error] Exception details",
	"traceid":"d2ef8eb328173570f6f016b0dd",
	"spanid":"7b65517dc9a",
	"severity":"ERROR",
	"flags":3,
	"attributes":{
		"exception.message":"No static resource auth/refresh.",
		"exception.stacktrace":"org.springframework.web.servlet.resource.NoResourceFoundException: No static resource auth/refresh.\\n\\tat org.springframework.web.servlet.resource.ResourceHttpRequestHandler.handleRequest(ResourceHttpRequestHandler.java:585)\\n\\tat org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter.handle(HttpRequestHandlerAdapter.java:52)\\n\\tat org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherSe...",
		"exception.type":"org.springframework.web.servlet.resource.NoResourceFoundException"
		},
	"instrumentation_scope":{"name":"org.re.goody.common.exception.ErrorLoggingService"
}}

 

์•Œ๋ฆผ ๋ฉ”์‹œ์ง€๋ฅผ ๋” ์ปค์Šคํ…€ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด, ๋กœ๊ทธ์˜ attributes ํ•„๋“œ์— ํฌํ•จ๋œ ์†์„ฑ๋“ค(์˜ˆ: exception.message, exception.stacktrace, exception.type ๋“ฑ)์„ ํ™•์ธํ•˜๊ณ  ํ•„์š”ํ•œ ํ•ญ๋ชฉ๋งŒ label_format์œผ๋กœ ์ถ”์ถœํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.


์ถ”์ถœํ•œ ํ•ญ๋ชฉ๋“ค์„ ์•Œ๋ฆผ ๋ฉ”์‹œ์ง€์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ปค์Šคํ…€์„ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

์•Œ๋ฆผ ๋ฉ”์‹œ์ง€ ์ œ๋ชฉ ๋ฐ ๋ณธ๋ฌธ ์ปค์Šคํ…€

Grafana > Alerting > Contact Points ์—์„œ ์ปค์Šคํ…€ ํ•˜๊ณ ์žํ•˜๋Š” ์•Œ๋ฆผ > edit ์„ ์„ ํƒ

Optional Slack settings ๋ฅผ ์—ด๋ฉด ์•Œ๋ฆผ ๋ฉ”์‹œ์ง€ ์ปค์Šคํ…€์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

์ €๋Š” ๋ณ„๋‹ค๋ฅธ ์ปค์Šคํ…€ ์—†์ด Title๊ณผ Text Body๋งŒ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

Title

๐Ÿšจ [FIRING] {{ .CommonLabels.alertname }} ( {{ .CommonLabels.exception_message }} )

Text Body

*์—๋Ÿฌ ์š”์•ฝ*
{{ .CommonLabels.body }}

*์—๋Ÿฌ ์›์ธ*
- Message: {{ .CommonLabels.exception_message }}
- Type: {{ .CommonLabels.exception_type }}

*์—๋Ÿฌ ๋ฐœ์ƒ ์œ„์น˜*
{{ .CommonLabels.error_location }}

*Trace*
- TraceId: `{{ .CommonLabels.traceid }}`

๐Ÿ” *Links*
- Logs: <{{ .CommonAnnotations.logs_url }}|๋ฐ”๋กœ ๋ณด๊ธฐ>
- Trace: <{{ .CommonAnnotations.trace_url }}|๋ฐ”๋กœ ๋ณด๊ธฐ>

body๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๋กœ๊น… ์‹œ ์ถœ๋ ฅํ•œ ๋ฉ”์‹œ์ง€ ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค. ํ”„๋กœ์ ํŠธ ๋กœ๊น… ์„ค์ •์— ๋”ฐ๋ผ ์ถœ๋ ฅ๋œ ๊ฐ’์ž…๋‹ˆ๋‹ค.

๐Ÿ’ก ์•ž์œผ๋กœ ์ถ”๊ฐ€ํ•  ์•Œ๋ฆผ๋“ค

ํ˜„์žฌ๋Š” ์„œ๋ฒ„ ๋‹ค์šด๊ณผ ERROR ๋กœ๊ทธ ๋‘ ๊ฐ€์ง€๋งŒ ์„ค์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ์•Œ๋ฆผ๋“ค๋„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ถ”๊ฐ€ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

์•Œ๋ฆผ ์ฟผ๋ฆฌ ์˜ˆ์‹œ ๊ถŒ์žฅ ์ž„๊ณ„์น˜
JVM Heap ์‚ฌ์šฉ๋ฅ  ๊ณผ๋‹ค jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} 80% ์ด์ƒ 5๋ถ„ ์ง€์†
HTTP 5xx ์—๋Ÿฌ ๊ธ‰์ฆ rate(http_server_requests_seconds_count{status=~"5.."}[5m]) ๋ถ„๋‹น 10๊ฑด ์ด์ƒ
HTTP ์‘๋‹ต์‹œ๊ฐ„ (P99) ์ดˆ๊ณผ histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m])) 3์ดˆ ์ด์ƒ
GC ๋นˆ๋„ ๊ณผ๋‹ค rate(jvm_gc_pause_seconds_count[5m]) ๋ถ„๋‹น 5ํšŒ ์ด์ƒ