Discussion:
[mdb-dev] mdb-export encoding issue
אריאל קלגסבלד Ariel Klagsbald
2011-09-21 09:55:12 UTC
Permalink
I hope this is the place to post such a problem. And I also hope my
diagnosys is correct (that it's really is an encoding problem. I'm not
sure).

Well, I have a large mdb file, in which one of the fields contains strings like

0007-20101223-214033-שמות-בגדר_שם.mp3

or

0007-20110714-213442-יום_טוב_שני_של_גלויות.mp3

That is, part english, part numbers and part Hebrew (yes, that's
hebrew, in case you can't see it in your browser).

When I use mdb-export to extract data from this file, I get the
numbers correctly, but only them. The hebrew and english parts are
simply missing (even the '3' in the 'mp3' suffix). That is, when I
extract the latter example I get only

0007-20110714-213442

I'll add that other fields contain only hebrew (e.g.
יום טוב שני של גלויות, יב' תמוז, תשע'א
in the example ebove), and they seem to be extracted correctly. That
is, I get some gibberish which I guess is the correct data, only my
terminal can't present it.

I though it might be an encoding problem, so I've played a bit with
MDB_ICONV, MDB_JET_CHARSET, MDB_JET3_CHARSET and MDB_JET4_CHARSET but
it showed no difference.
The file seems to be JET4 (so mdb-ver claims). I've no idea what
encoding does it use (I don't know how to find out. Any ideas?), but I
guess it's utf-8 (only a guess).

I'll be grateful for any help!
Ariel.
Nirgal
2011-09-21 10:20:29 UTC
Permalink
Jet4 always use unicode (UCS2) internally.

Output should be utf-8, unless you set env var MDBICONV (there is no underscore).
Post by אריאל קלגסבלד Ariel Klagsbald
I hope this is the place to post such a problem. And I also hope my
diagnosys is correct (that it's really is an encoding problem. I'm not
sure).
Well, I have a large mdb file, in which one of the fields contains strings like
0007-20101223-214033-שמות-בגדר_שם.mp3
or
0007-20110714-213442-יום_טוב_שני_של_גלויות.mp3
That is, part english, part numbers and part Hebrew (yes, that's
hebrew, in case you can't see it in your browser).
When I use mdb-export to extract data from this file, I get the
numbers correctly, but only them. The hebrew and english parts are
simply missing (even the '3' in the 'mp3' suffix). That is, when I
extract the latter example I get only
0007-20110714-213442
I'll add that other fields contain only hebrew (e.g.
יום טוב שני של גלויות, יב' תמוז, תשע'א
in the example ebove), and they seem to be extracted correctly. That
is, I get some gibberish which I guess is the correct data, only my
terminal can't present it.
I though it might be an encoding problem, so I've played a bit with
MDB_ICONV, MDB_JET_CHARSET, MDB_JET3_CHARSET and MDB_JET4_CHARSET but
it showed no difference.
The file seems to be JET4 (so mdb-ver claims). I've no idea what
encoding does it use (I don't know how to find out. Any ideas?), but I
guess it's utf-8 (only a guess).
I'll be grateful for any help!
Ariel.
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
mdbtools-dev mailing list
https://lists.sourceforge.net/lists/listinfo/mdbtools-dev
אריאל קלגסבלד Ariel Klagsbald
2011-09-23 06:03:09 UTC
Permalink
Post by Nirgal
Jet4 always use unicode (UCS2) internally.
Output should be utf-8, unless you set env var MDBICONV (there is no underscore).
I tried again and again, and MDBICONV seems to have no effect (a strange
fact by itself). Any more ideas please? Maybe I'm wrong, and it isn't an
encoding problem. Can something else cause emd-export to ignore half of the
field?


[***@ch ~/]$ setenv MDBICONV UTF-8
[***@ch ~/]$ mdb-export -QHd^ WebStructure.mdb FilePaths | grep '^130419\^'
130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00
00:00:00^84^20223203^0113^1^0^45 ᅵᅵזᅵ ᅵטᅵᅵי ᅵᅵטᅵᅵ, ᅵᅵ' ᅵᅵך, ךי'ב^0^0^0^0
[***@ch ~/]$ setenv MDBICONV iso-8859-1
[***@ch ~/]$ mdb-export -QHd^ WebStructure.mdb FilePaths | grep '^130419\^'
130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00
00:00:00^84^20223203^0113^1^0^45 ᅵᅵזᅵ ᅵטᅵᅵי ᅵᅵטᅵᅵ, ᅵᅵ' ᅵᅵך, ךי'ב^0^0^0^0
[***@ch ~/]$ setenv MDBICONV nothingatall
[***@ch ~/]$ mdb-export -QHd^ WebStructure.mdb FilePaths | grep '^130419\^'
130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00
00:00:00^84^20223203^0113^1^0^45 ᅵᅵזᅵ ᅵטᅵᅵי ᅵᅵטᅵᅵ, ᅵᅵ' ᅵᅵך, ךי'ב^0^0^0^0
[***@ch ~/]$


See? MDBICONV has no effect. The 10th field (it's hebrew) seems the same
(even if your terminal doesn't show hebrew, you can see there's no
difference), and the 3rd field is still truncated. Only the numbers appear.



Any help please?!?
Post by Nirgal
On Wednesday 21 September 2011 09:55:12 א׹יאל קלגסבלד Ariel Klagsbald
Post by אריאל קלגסבלד Ariel Klagsbald
I hope this is the place to post such a problem. And I also hope my
diagnosys is correct (that it's really is an encoding problem. I'm not
sure).
Well, I have a large mdb file, in which one of the fields contains strings like
0007-20101223-214033-שמות-בגד׹_שם.mp3
or
0007-20110714-213442-יום_טוב_שני_של_גלויות.mp3
That is, part english, part numbers and part Hebrew (yes, that's
hebrew, in case you can't see it in your browser).
When I use mdb-export to extract data from this file, I get the
numbers correctly, but only them. The hebrew and english parts are
simply missing (even the '3' in the 'mp3' suffix). That is, when I
extract the latter example I get only
0007-20110714-213442
I'll add that other fields contain only hebrew (e.g.
יום טוב שני של גלויות, יב' תמוז, תשע'א
in the example ebove), and they seem to be extracted correctly. That
is, I get some gibberish which I guess is the correct data, only my
terminal can't present it.
I though it might be an encoding problem, so I've played a bit with
MDB_ICONV, MDB_JET_CHARSET, MDB_JET3_CHARSET and MDB_JET4_CHARSET but
it showed no difference.
The file seems to be JET4 (so mdb-ver claims). I've no idea what
encoding does it use (I don't know how to find out. Any ideas?), but I
guess it's utf-8 (only a guess).
I'll be grateful for any help!
Ariel.
------------------------------------------------------------------------------
Post by Nirgal
Post by אריאל קלגסבלד Ariel Klagsbald
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
mdbtools-dev mailing list
https://lists.sourceforge.net/lists/listinfo/mdbtools-dev
Jakob Egger
2011-09-23 06:53:51 UTC
Permalink
In Jet4, all text is stored using the UCS2 encoding. However, Access uses a special trick to reduce storage requirements: ASCII characters are stored as single byte characters, and all others are stored as two byte characters. A null byte is used to switch between the two encoding methods. In your field, such a NULL byte will appear between the ASCII text and the hebrew characters. Apparently, this NULL byte causes mdb-export to believe it has reached the end of the string. This shouldn't happen.

Which version of mdb-export are you using? There are a lot of old versions "in the wild". It is best if you compile the most current version from github yourself so you can ensure you are using a recent version.

I'd be glad to try reading your file with the newest version, if you are interested just send me a copy of the database to my private email address.

Best regards,
Jakob
Post by Nirgal
Jet4 always use unicode (UCS2) internally.
Output should be utf-8, unless you set env var MDBICONV (there is no underscore).
I tried again and again, and MDBICONV seems to have no effect (a strange fact by itself). Any more ideas please? Maybe I'm wrong, and it isn't an encoding problem. Can something else cause emd-export to ignore half of the field?
130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00 00:00:00^84^20223203^0113^1^0^45 ᅵᅵזᅵ ᅵטᅵᅵי ᅵᅵטᅵᅵ, ᅵᅵ' ᅵᅵך, ךי'ב^0^0^0^0
130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00 00:00:00^84^20223203^0113^1^0^45 ᅵᅵזᅵ ᅵטᅵᅵי ᅵᅵטᅵᅵ, ᅵᅵ' ᅵᅵך, ךי'ב^0^0^0^0
130419^9817^0113-20000101-010645-45_^Hebrew|HWomen|HinuchYeladimShlomBayit|HinuchYeladim|R0113-5|R0113-2^01/01/00 00:00:00^84^20223203^0113^1^0^45 ᅵᅵזᅵ ᅵטᅵᅵי ᅵᅵטᅵᅵ, ᅵᅵ' ᅵᅵך, ךי'ב^0^0^0^0
See? MDBICONV has no effect. The 10th field (it's hebrew) seems the same (even if your terminal doesn't show hebrew, you can see there's no difference), and the 3rd field is still truncated. Only the numbers appear.
Any help please?!?
Post by Nirgal
Post by אריאל קלגסבלד Ariel Klagsbald
I hope this is the place to post such a problem. And I also hope my
diagnosys is correct (that it's really is an encoding problem. I'm not
sure).
Well, I have a large mdb file, in which one of the fields contains strings like
0007-20101223-214033-שמות-בגד׹_שם.mp3
or
0007-20110714-213442-יום_טוב_שני_של_גלויות.mp3
That is, part english, part numbers and part Hebrew (yes, that's
hebrew, in case you can't see it in your browser).
When I use mdb-export to extract data from this file, I get the
numbers correctly, but only them. The hebrew and english parts are
simply missing (even the '3' in the 'mp3' suffix). That is, when I
extract the latter example I get only
0007-20110714-213442
I'll add that other fields contain only hebrew (e.g.
יום טוב שני של גלויות, יב' תמוז, תשע'א
in the example ebove), and they seem to be extracted correctly. That
is, I get some gibberish which I guess is the correct data, only my
terminal can't present it.
I though it might be an encoding problem, so I've played a bit with
MDB_ICONV, MDB_JET_CHARSET, MDB_JET3_CHARSET and MDB_JET4_CHARSET but
it showed no difference.
The file seems to be JET4 (so mdb-ver claims). I've no idea what
encoding does it use (I don't know how to find out. Any ideas?), but I
guess it's utf-8 (only a guess).
I'll be grateful for any help!
Ariel.
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
mdbtools-dev mailing list
https://lists.sourceforge.net/lists/listinfo/mdbtools-dev
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2_______________________________________________
mdbtools-dev mailing list
https://lists.sourceforge.net/lists/listinfo/mdbtools-dev
Loading...