Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/oracledb] discrepancy between the metric values in DBA_TABLESPACE_USAGE_METRICS and the ones calculated by the receiver #31451

Closed
dloucasfx opened this issue Feb 27, 2024 · 4 comments · Fixed by #32028
Labels
bug Something isn't working receiver/oracledb

Comments

@dloucasfx
Copy link
Contributor

dloucasfx commented Feb 27, 2024

Component(s)

receiver/oracledb

What happened?

Description

The oracledb receiver uses the below [queries] respectively (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/oracledbreceiver/scraper.go#L41-L42) to set values for oracledb.tablespace_size.limit and oracledb.tablespace_size.usage in bytes

	tablespaceUsageSQL      = "select TABLESPACE_NAME, BYTES from DBA_DATA_FILES"
	tablespaceMaxSpaceSQL   = "select TABLESPACE_NAME, (BLOCK_SIZE*MAX_EXTENTS) AS VALUE FROM DBA_TABLESPACES"

Our backend uses those metrics to calculate the % used tablespace (oracledb.tablespace_size.usage/oracledb.tablespace_size.limit)*100

However, when we look at the values in table DBA_TABLESPACE_USAGE_METRICS https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/DBA_TABLESPACE_USAGE_METRICS.html#GUID-FE479528-BB37-4B55-92CF-9EC19EDF4F46 , the percentage used and the table size/used space, although in data blocks, still do not match the receiver's datapoints when converted.

Steps to Reproduce

enable the oracledb receiver and compare (oracledb.tablespace_size.usage/oracledb.tablespace_size.limit)*100 to USED_PERCENT from DBA_TABLESPACE_USAGE_METRICS

Expected Result

Both values should match

Actual Result

Different values

Collector version

0.93.0

@dloucasfx dloucasfx added bug Something isn't working needs triage New item requiring triage labels Feb 27, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

crobert-1 commented Feb 27, 2024

It looks like there's at least one problem here, and then I have a clarifying question to see if there's another.

  1. oracledb.tablespace_size.limit = BLOCK_SIZE * MAX_EXTENTS looks wrong. From documentation:
At the finest level of granularity, Oracle stores data in data blocks (also called logical blocks, Oracle blocks, or pages). One data block corresponds to a specific number of bytes of physical database space on disk.

The next level of logical database space is called an extent. An extent is a specific number of contiguous data blocks allocated for storing a specific type of information.

The existing calculation is missing logic to account for how many data blocks are in each extent. We could use INITIAL_EXTENT and NEXT_EXTENT with MAX_EXTENTS together instead of BLOCK_SIZE, but I'm not sure if this is a definitive way to calculate this value.
2. While using oracledb.tablespace_size.usage in a calculation, do you know if its datapoints are summed together within the metric for this calculation? If not, it would be the wrong usage value as it needs to be the sum of every data file's used bytes within the tablespace, in case there are multiple data files within the tablespace. The metric has a datapoint for each tablespace, with the tablespace_name attribute.

We could also expose USED_PERCENT for each tablespace name if that's preferable, so the user can just opt in instead of doing their own backend calculation. It should be a pretty simple enhancement to just add another metric. However, we still need to fix incorrect metrics, either way.

@crobert-1
Copy link
Member

Background investigation

Doing some local testing with DBA_TABLESPACE_USAGE_METRICS, here are some interesting results using the SYSTEM tablespace.

DBA_TABLESPACE_USAGE_METRICS: used size: 109880, tablespace size: 3882269, used percent: 2.83030361883733455873356534542042295369. Note: Values for size are in blocks

Block size is 8192 bytes for each tablespace in my instance.

Findings on oracledb.tablespace_size.limit

This is the metric data for oracldb.tablespace_size.limit for the SYSTEM tablespace.

Metric #19
Descriptor:
     -> Name: oracledb.tablespace_size.limit
     -> Description: Maximum size of tablespace in bytes, -1 if unlimited.
     -> Unit: By
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> tablespace_name: Str(SYSTEM)
StartTimestamp: 2024-03-27 21:21:11.040735 +0000 UTC
Timestamp: 2024-03-27 21:21:12.103002 +0000 UTC
Value: 17592186019840

This value is in bytes, but converted to blocks it's 2147483645 blocks.

This means the metric we're calculating and sending as the limit is 553 times larger than what's being reported by DBA_TABLESPACE_USAGE_METRICS.

Findings on oracledb.tablespace_size.usage:

Metric #20
Descriptor:
     -> Name: oracledb.tablespace_size.usage
     -> Description: Used tablespace in bytes.
     -> Unit: By
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> tablespace_name: Str(SYSTEM)
StartTimestamp: 2024-03-27 21:21:11.040735 +0000 UTC
Timestamp: 2024-03-27 21:21:12.09212 +0000 UTC
Value: 917504000

This value is in bytes, so converting it to blocks we get the following:

917504000 / 8192 = 112000 blocks

The DBA_TABLESPACE_USAGE_METRICS reported value is 109880, so this is relatively close, IMO.

Proposed solution

I think the best solution here would be to rely on DBA_TABLESPACE_USAGE_METRICS as the definitive source for usage variables, then use BLOCK_SIZE from DBA_TABLESPACES to convert block counts to bytes for these metrics. This will result in a slight loss of granularity (as bytes will always be multiples of 8192) but I think this is acceptable. I think this is preferential over our current solution as queries needing to account for Oracle DB's inner architecture that changes based on instance doesn't seem like a stable and correct long term approach (as shown by this issue).

@crobert-1
Copy link
Member

I realize that this may just be a discrepancy between "maximum possible size of tablespace" and "tablespace size" to change how we're calculating oracledb.tablespace_size.limit, but I think the proposed alternative is a more useable value.

dmitryax pushed a commit that referenced this issue Mar 28, 2024
)

**Description:** 
Values were being scraped incorrectly for the metrics
`oracledb.tablespace_size.limit` and `oracledb.tablespace_size.usage`.
The changes these metrics to be scraped from the
[`DBA_TABLESPACE_USAGE_METRICS`](https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/DBA_TABLESPACE_USAGE_METRICS.html#GUID-FE479528-BB37-4B55-92CF-9EC19EDF4F46)
table. This results in a slight loss of granularity in these metrics, as
values will always be in multiples of the respective tablespace's block
size, but I think the clarity and simplicity is worth the trade off.

Note: The value of the usage metric was generally close to the expected
value, but the limit was being calculated as potential theoretical
capacity, unbound by server capacity. For example, in testing in a
docker container on my local machine, limit was set to **17TB**. This
doesn't line up with user expectations.

**Link to tracking Issue:**
Fixes
#31451

**Testing:** 
Updated existing tests, added a couple new ones.

Also, the original issue filed was comparing
`DBA_TABLESPACE_USAGE_METRICS` output for percent used to what we got
from `usage/limit * 100`. Here's the local testing outputs compared to
show they now line up.
```
2024-03-27T16:31:57.938-0700    info    oracledbreceiver/scraper.go:285 DBA_TABLESPACE_USAGE_METRICS: Tablespace name: SYSTEM, used space: 111288, tablespace size: 3518587, percent used: 3.16286054600895188892586711654422641816    {"kind": "receiver", "name": "oracledb", "data_type": "metrics"}
```

```
Metric #20
Descriptor:
     -> Name: oracledb.tablespace_size.usage
     -> Description: Used tablespace in bytes.
     -> Unit: By
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> tablespace_name: Str(SYSTEM)
StartTimestamp: 2024-03-27 23:31:56.873576 +0000 UTC
Timestamp: 2024-03-27 23:32:12.523295 +0000 UTC
Value: 911671296
```

```
Metric #19
Descriptor:
     -> Name: oracledb.tablespace_size.limit
     -> Description: Maximum size of tablespace in bytes, -1 if unlimited.
     -> Unit: By
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> tablespace_name: Str(SYSTEM)
StartTimestamp: 2024-03-27 23:31:56.873576 +0000 UTC
Timestamp: 2024-03-27 23:32:12.523295 +0000 UTC
Value: 28824264704
```
Doing the same calculation, we get:
```
(911671296 / 28824264704) * 100 = ~3.16%
```
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
…n-telemetry#32028)

**Description:** 
Values were being scraped incorrectly for the metrics
`oracledb.tablespace_size.limit` and `oracledb.tablespace_size.usage`.
The changes these metrics to be scraped from the
[`DBA_TABLESPACE_USAGE_METRICS`](https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/DBA_TABLESPACE_USAGE_METRICS.html#GUID-FE479528-BB37-4B55-92CF-9EC19EDF4F46)
table. This results in a slight loss of granularity in these metrics, as
values will always be in multiples of the respective tablespace's block
size, but I think the clarity and simplicity is worth the trade off.

Note: The value of the usage metric was generally close to the expected
value, but the limit was being calculated as potential theoretical
capacity, unbound by server capacity. For example, in testing in a
docker container on my local machine, limit was set to **17TB**. This
doesn't line up with user expectations.

**Link to tracking Issue:**
Fixes
open-telemetry#31451

**Testing:** 
Updated existing tests, added a couple new ones.

Also, the original issue filed was comparing
`DBA_TABLESPACE_USAGE_METRICS` output for percent used to what we got
from `usage/limit * 100`. Here's the local testing outputs compared to
show they now line up.
```
2024-03-27T16:31:57.938-0700    info    oracledbreceiver/scraper.go:285 DBA_TABLESPACE_USAGE_METRICS: Tablespace name: SYSTEM, used space: 111288, tablespace size: 3518587, percent used: 3.16286054600895188892586711654422641816    {"kind": "receiver", "name": "oracledb", "data_type": "metrics"}
```

```
Metric open-telemetry#20
Descriptor:
     -> Name: oracledb.tablespace_size.usage
     -> Description: Used tablespace in bytes.
     -> Unit: By
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> tablespace_name: Str(SYSTEM)
StartTimestamp: 2024-03-27 23:31:56.873576 +0000 UTC
Timestamp: 2024-03-27 23:32:12.523295 +0000 UTC
Value: 911671296
```

```
Metric open-telemetry#19
Descriptor:
     -> Name: oracledb.tablespace_size.limit
     -> Description: Maximum size of tablespace in bytes, -1 if unlimited.
     -> Unit: By
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> tablespace_name: Str(SYSTEM)
StartTimestamp: 2024-03-27 23:31:56.873576 +0000 UTC
Timestamp: 2024-03-27 23:32:12.523295 +0000 UTC
Value: 28824264704
```
Doing the same calculation, we get:
```
(911671296 / 28824264704) * 100 = ~3.16%
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/oracledb
Projects
None yet
2 participants