Why does my z/OS code look like gibberish with git?
TLDR;
You migrated your code into a git repository in Unix using the wrong code page, and/or forgot to update your .gitattributes file to reflect the change between EBCDIC and UTF-8.
The long answer:
First, a background on character sets. A character set is an encoding system to let computers know how to recognize Characters, including letters, numbers, punctuation marks, and whitespace. In earlier times, countries developed their own character sets due to their different languages used, such as Kanjii, Hebrew, and Bengali.
In the Americas, and many countries in Western Europe, the most common character sets used on the mainframe are IBM code page 37 (IBM-037), IBM code page 1047 (IBM-1047), IBM code page 1140 (IBM-1140), and Unicode (UTF-8).
Before an organization modernizes, they might have all their source available in partitioned datasets, which are most commonly encoded in IBM-037. More recently created datasets may be encoded in IBM-1047. Both are very similar except for 6 characters (more on that in a moment). Unix System Services is typically a mixture of IBM-1047, with any modern tooling using UTF-8. The overlap between IBM-1047 and UTF-8 is sparse at best. Let’s look at an example:
Character | Unicode HEX | IBM-037 HEX | IBM-1047 HEX |
G (capital G) | 0047 | C7 | C7 |
^ (circumflex) | 005E | B0 | 5F |
] (close square bracket) | 005D | BB | BD |
[ (open square bracket) | 005B | BA | AD |
Ý (Latin Y with acute) | 00DD | AD | BA |
¨ (diaresis) | 00A8 | BD | BB |
¬ (logical not) | 00AC | 5F | B0 |
Let’s say we migrated a C++ source member that is stored in IBM-037, but assumed it was stored in IBM-1047, and it has the following line of code:
int
main(int
intGain, char
const* arg[])
Once migrated from the PDSE into a Unix directory using the DBB migration commands, this will actually become the following string:
int
main(int
intGain, char
const* argvݨ)
Notice that the letter “G” looks correct, but the square brackets are exchanged for ݨ
. That is because the utility reads the character “[” (which is hex value BB), but since it is using the wrong character set, maps it to the Ý
character.
Thus, part of the key to a successful migration is making sure that you are reading the dataset in the correct code page. Make sure you test for the 6 characters listed above. Identify source that has each of those characters, and then run some test migrations for those source members. If they look correct in the Unix environment after migration, you know you’ve picked the correct code page to migration from.
For more information on IBM EBCDIC character sets refer to this table: https://en.wikibooks.org/wiki/Character_Encodings/Code_Tables/EBCDIC/EBCDIC_1047
The next part of migration is in pushing your new repository code up to the git server. Git on nearly every cloud provider is built for Unicode. Thus, the code must be converted from EBCIDC to Unicode in flight to the git provider. However, git must understand what it is translating to and from. For example, converting from IBM-1047 to UTF-8. This conversion is defined in the .gitattributes file of every git repository (at least all the ones you have that contain z/OS source code). The format of the .gitattributes file is as follows:
# line endings
* text eol=lf
# file encodings
*.cpy zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.cbl zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.bms zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.pli zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.mfs zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.bnd zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.lnk zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.txt zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.groovy zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.sh zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.properties zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.asm zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.jcl zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.mac zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
*.json zos-working-tree-encoding=utf-8 git-encoding=utf-8
*.yaml zos-working-tree-encoding=utf-8 git-encoding=utf-8
*.config zos-working-tree-encoding=ibm-1047 git-encoding=utf-8
Comment lines begin with a hash, or an octothorpe (#). The asterisk acts as a wildcard. The second part of each rule defines the zOS working tree (what character set the mainframe expects), and the git-encoding specifies what character set the member is stored in the git repository. Most members, as you can see above, will convert between IBM-1047 and UTF-8, such as COBOL, PL/I, JCL, etc. Some members remain in UTF-8 such as json and yaml files. Those members never go into a dataset, and are only read by humans, and stored in result sets.
Let’s use another example. Let’s say you have a member that you have migrated that is C++ code and migrated to a Unix git repo. It was stored in EBCIDC 1047 in the repo (as it was in the PDSE). If the file extension is .cpp (which is not listed above), this might get stored as-is in the git repository and assumed as being UTF-8. Then, when you later attempt to build the file, you would get an error message during the build. When looking at the member, it is now a mess of gibberish code. That is because it did not get converted to UTF-8 before being stored in the git repo, but then when the repo was later cloned during a build cycle, the source member did a code conversion to 1047. That means that say letter G was never converted from HEX 47 to C7. However, when attempting the reverse, hex 47 was instead converted to hex E5 which would be the å character in EBCDIC.