The blog for Design Patterns, Linux, HA and Myself!
The solution to the issue, InfluxDB Out of Memory, can be solved in multiple folds sequentially. This article presents the solutions that I’ve tried to resolve InfluxDB out of Memory issue.
How to confirm if the InfluxDB is getting killed because of OOM? You can execute dmesg -T | grep influx
to find that
out:
$ dmesg -T | grep influx
[Mon Feb 14 00:19:48 2020] [***] ***** ****** ******** ******** ***** * influxd
[Mon Feb 14 00:19:48 2020] Out of memory: Kill process 15453 (influxd) score 720 or sacrifice child
The very first occurrence of this issue was due of the default InfluxDB configuration for index-version
.
# The type of shard index to use for new shards. The default is an in-memory index that is
# recreated at startup. A value of "tsi1" will use a disk based index that supports higher
# cardinality datasets.
index-version = "inmem"
The default value is inmem
for index-version
and it makes the InfluxDB to store the data in the memory. In case, you’re
getting the Out of Memory error, and the value of this config is inmem
then, may be, the issue can be resolved if you
change the value of this configuration from inmem
to tsi1
.
If you’ve already changed the index-version
to tsi1
and still facing the issue then it’s most probably due to the
data and the high cardinality of series and the tag values that is present inside the influxDB server.
In this article, I’ll present the methods that I’ve used to get the information about the data that is causing the Out of Memory for the InfluxDB process.
The first step is to find the database where the problem lies, and then to find the measurement, and the tags present
inside the found database. I used influx_inspect
utility get the report of each of the databases that are using the
tsi1
as their index version. This utility is present at the same location as the influxd
binary, and the location
where the InfluxDB stores the data is present in the configuration file:
[data]
# The directory where the TSM storage engine stores TSM files.
dir = "/var/lib/influxdb/data"
Navigate to the data directory:
$ cd /var/lib/influxdb/data
$ ls -ltr
total 16
drwxrwxrwx 1 x x 1738 Feb 14 18:19 location_data/
drwxrwxrwx 1 x x 4096 Feb 14 18:19 usage_metric/
drwxrwxrwx 1 x x 4096 Feb 14 18:19 pg_metric/
...
...
Then you can execute the command influx_inspect
with the sub command reporttsi
to generate a report of the tsi
index.
$ influx_inspect reporttsi -db-path location_data
Link to the Influx Inspect disk utility.
The report that it generates has two parts, one for all the measurements inside the database and another for all the shards in the database.
The numbers that you’re seeing here is dummy, and I’ve just created them for this page.
This is the first part:
Summary
Database Path: /var/lib/influxdb/data/location_data/
Cardinality (exact): 437993
Measurement Cardinality (exact)
"india" 218996
"srilanka" 54749
"thailand" 27374
"nepal" 24333
"bhutan" 3041
So, we find here that measurement, india
, has the highest, 50%, contribution to the high cardinality. This way it becomes
the first suspect.
In the second part of the report, you can find the growth in the cardinality on shard basis:
This is the first shard from the report:
===============
Shard ID: 1537
Path: /var/lib/influxdb/data/location_data/1537
Cardinality (exact): 1080
Measurement Cardinality (exact)
"india" 579
"srilanka" 233
"thailand" 135
"nepal" 124
"bhutan" 9
===============
the next one:
===============
Shard ID: 1610
Path: /var/lib/influxdb/data/location_data/1610
Cardinality (exact): 2159
Measurement Cardinality (exact)
"india" 1181
"srilanka" 312
"thailand" 253
"nepal" 247
"bhutan" 166
===============
You’ll find here that the cardinality has increased for all the measurements but for india
it has doubled. Now, let’s
look into the actual tag values for the measurement, india
, that is contributing to the high series cardinality.
We’ll use the influx
binary/client to work get the numbers. The first query being executed is to get all the tag keys
from the measurement.
$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag keys from india"
name: india
tagKey
------
city
state
zip
We have 3 tag keys here: city
, state
and zip
.
Now, I’ll execute a query to find the unique values for each of the tags. For readability, I’ll save the output in a file.
$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"city\"" > cities.txt
$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"state\"" > states.txt
$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"zip\"" > zips.txt
Now, let’s look into the count of the unique tag values for each of the tags:
$ cat cities.txt | wc -l
614
$ cat states.txt | wc -l
29
$ cat zips.txt | wc -l
889
The output that you’ll get here will be different from this, but you will, definitely, find at least one tag that contains more data than expected.
For this sample data, it’s the city
tag because it is containing not only the name of the city but also, the name of the
town, like,
$ tail -n 5 cities.txt
Delhi, New Delhi
Andheri, Mumbai
Goregaon, Mumbai
Jogeshwari, Mumbai
Juhu, Mumbai
Now that you’ve found the issue, next item that we’ve to pick up is the approach to solve this issue. In my case, the solution was very easy as I had to just delete these series. In an another scenario, I had to downsample the data that is more than 15 days old, that way I was only keeping the HiFi data for a very recent period(last 15 days) and for the older period, every 1 hour’s data was converted into 1 InfluxDB point.
Here’s link to the InfluxDB downsampling guide: Downsample and retain data | InfluxDB