Finding something unexpected
After having visualised the “uncharted” parts of
dsatnord.mp
our quest to
find the index for the tiles continues.
Before we start, let me briefly explain what I mean with “index for the tiles” and what I expect to find: the satellite images of Germany consist of individual tiles that the viewer shows when zooming into a specific region. To find the correct tile for a geographic area, there must be an index that provides some coordinates for each tile. My assumption is, that for each (quadratic) tile the index contains a coordinate for a fixed position of the tile, for instance, for the lower left corner. Visualising those coordinates for all tiles would yield a lattice pattern with the shape of Germany – something like this but much denser:
Of course, other technical solutions would be thinkable. Specifically, there is no need to use the lower left corner or a fixed position at all for that matter. However, we need some hypothesis to start with and that is mine.
So let us have a look at the unknown parts of dsatnord.mp
to search
for the tile index. Most promising looks the third part (un3.dat
),
since it clearly consists of two parts, each revealing a periodic
pattern:
(Spoiler alert: before I analysed that part I actually had a closer
look at the first part un1.dat
and found data that is related to the
tiles but I will report that later, since I consider the find
described here more remarkable.)
My first goal was to find the byte offset in un3.dat
where the first
part ends and the second begins. Using Gimp, I found that the second
part begins roughly in row 957 (of 2610) and column 890 (of 1024).
Since each pixel represents one byte, that translates into the byte
offset 1024 * 957 + 890 = 980858. Since un3.dat
has 2672062 bytes
overall, this is roughly at 37%, which fits with the image. To get
multiples of 16, I shortened the second part by 4 bytes, that is, from
offset 980858 to offset 980862 (we will later see why that makes
sense):
dd if=un3.dat of=un3_1.dat bs=4M count=980862 iflag=count_bytes
dd if=un3.dat of=un3_2.dat bs=4M skip=980862 iflag=skip_bytes
Again, we can visualise the results to check whether we split the file correctly:
./src/mp.py -c vis_bytes -o un3_1.png un3_1.dat
./src/mp.py -c vis_bytes -o un3_2.png un3_2.dat
The results (not shown here) look good.
Now my assumption was that the index contains a record (with the
coordinates and possibly other information) of fixed length for each
tile. I started with the second part (un3_2.dat
) since it showed
quite some regularity and performed different analyses to test that
hypothesis. Among those were:
- Creating successive n-byte ints/floats and visualising their
correlation using
seaborn.pairplot
. (This could have led me to the result but it did not work with the whole part such that I used just the first 10kB of the data which was not enough to recognise a pattern.) - Measuring distances between successive n-byte ints/floats and visualising their distribution using histograms (not really helpful) and scatterplots. (The motivation behind that analysis was that tiles of equal size should have approximately equally spaced coordinates, resulting in approximately the same distances between coordinates. The results were some weird patterns which indicated that there must be something regular.)
- Visualising the distribution of the byte values. (I saw some spikes but could draw no real conclusion.)
I did some more analyses along those lines but have not documented them well, so this post focuses on the successful path, something which I had in mind the whole time: computing the autocorrelation between byte values, to find the size of each record. My assumption was that the index consists of records for the tiles and each record has a fixed structure which results in the repeating pattern we saw initially. Autocorrelation can help us to find the frequency of the pattern and thus the record length. Here’s what I did:
reading the bytes into a dataframe
from struct import unpack
import pandas as pd
with open("../un3_2.dat", "rb") as f:
vals = []
while ((data := f.read(1))):
vals.append(int.from_bytes(data, byteorder="little", signed=False))
df = pd.DataFrame(vals, columns=["ints"])
computing and plotting the autocorrelation
import statsmodels.tsa.stattools as smtsa
import numpy as np
acf = smtsa.acf(df.ints, nlags=100, adjusted=False, fft=False)
lags = np.arange(len(acf))
plt.rcParams['figure.figsize'] = (10, 5)
plt.vlines([6, 10, 16], -0.2, 0.8, color="lightgrey")
plt.plot(lags[1:], acf[1:])
plt.xlabel("bytes")
plt.xlim(xmin=0)
plt.ylabel("correlation")
plt.show()
The result looks as follows:
We can see a high correlation at 16, meaning there is a repeating pattern every 16 bytes. So that is likely our record size.
Now I built a dataframe with 16 columns: each containing one record per row such that each byte is represented as an integer value in each column.
from struct import unpack
import pandas as pd
bytelen = 16
with open("../un3_2.dat", "rb") as f:
vals = []
while ((data := f.read(bytelen))):
vals.append([int.from_bytes(data[i:i+1]) for i in range(bytelen)])
df = pd.DataFrame(vals, columns=["i" + str(i) for i in range(bytelen)])
Next, I was analysing the different columns, for example, checking
their
value_counts
. I
saw that some columns (that is, byte positions) contain only very few
(e.g., 1, 2, or 3) different values and others contained all possible
256 byte values. I skip some details here but the next plot (plus the
raw numbers) gave me a clue how to proceed:
fig, ax = plt.subplots(4)
for i in range(4):
df["i" + str(i)].hist(bins=256, ax=ax[i])
ax[i].set_xlim(0, 256)
plt.show()
Byte 0 looks random, byte 1 looks random, byte 2 looks much less random and byte 3 contains only three different values (71, 72, 70 with frequencies 62076, 39222, 4402, respectively – unfortunately not visible in the plot). My guess was that these are the four bytes of a number in little endian order, because the least significant bits (i.e., the first two bytes) would show high variation but the most significant bits should be more limited, as the coordinates are restricted to Germany.
I saw a similar pattern with bytes 4 to 7, so I read the first 8 bytes into two 32 bit integers (little endian, unsigned) and visualised them in a scatter plot:
Surprised? Any idea, what this could be?
Well, that’s definitely not the tile index, I thought. But it’s also definitely not random. It looks like lines … polygons … maybe … wait a second. Let’s turn this by 180° (and use floats instead of ints, although that came one step later):
The borders of the states of Germany and the main highways!
Since we have just decoded the first 8 bytes of the 16 byte record, the remaining bytes certainly encode more information. For example, byte 13 has just three distinct values with the following frequencies:
value | frequency |
---|---|
1 | 43150 |
0 | 38260 |
2 | 24290 |
So it is safe to assume that it encodes three different things. Assigning the colours red, green, and blue to 0, 1, and 2, respectively, we get the following map:
(I fixed the distortion with plt.gca().set_aspect('equal')
.)
So 0 seems to encode highways, 1 state borders, and 3 the border of Germany (with some exceptions mainly in the west).
Although for the part un3_2
we analysed here there are still 7 bytes
left to decode for each record, overall, this is a big step forward to
fully understand the structure of dsatnord.mp
. So even though I have
(again) not found the tile index (yet), I am very happy about this
finding. It was also kind of unexpected, since the D-Sat 1 CD-ROM
contains a file dsat.vec
which contains strings like “A100” and
“A10/E30” which are clearly names for highways. Thus I assumed that
this vector data is (only) contained in that file but that is
apparently not the case.
Most of my analyses are contained in this Jupyter Notebook.