We have been having the Thali postcard demo ready for quite a long time, and while we been using it, it has been really easy to see that the Wi-Fi Direct discovery is really not up to par. It does work, time to time, but most often it just fails to discover some peers, or occasionally it simply takes way too long it to discover anything. On top of this, it very often also stops seeing peers that are still around, and marks them unavailable, thus making it impossible to communicate with them.
One alternative approach for the discovery would be to use BLE, and as the data we require to exchange can’t be included in the scan record data (it has total limit of 31 bytes), we can’t use simply beacon approach, but instead we would need to use a real connectable peripheral, which would serve the data via the GATT service.
As I was going over this earlier in summer, we did not really have a good view of what hardware requirements BLE would pose. In essence BLE central (i.e. scanning) has been working since API level 18, though some might argue that it’s crappy with API level 18, and you should actually use API level 19.
But with peer discovery, we do also need to be able to advertise our presence, for this we need to get the BLE Peripheral to be supported. This is then supported from API level 21 (Lollipop), and same time there was new scanning API released, bringing better battery usage as well as supposedly tons of improvements. So far with my simple tests, I have not seen any big differences between the BLE scanner APIs, though with KitKat API you do need to create own parser for the scan record.
So from the API point of view the requirement is that the device must support Lollipop. Then we soon discovered that not all Lollipop devices support being BLE Peripheral. For example LG Nexus 5, did have support for the API in early pre-launch versions of the Lollipop for it, but with 5.0.1 at least it was removed. Supposedly the reason being that the HW with it does not support being BLE peripheral & central same time.
For Thali discovery the requirement indeed would be that the device must be able to simultaneously to advertise its presence as well as to seek other devices advertising their presence. Then quick look into the Bluetooth specs made it very clear, it indeed appears that Bluetooth 4.1 lets any device be both a peripheral and a hub at the same time. Small change in terminology there, but indeed that’s what was needed. Thus the revised device specs was that all devices must support Lollipop as well as have Bluetooth 4.1 HW. Back in the summer when I made my first list, I did only find less than 10 devices which fitted into these specs, luckily the numbers of devices is increasing rapidly.
Then as I now knew which devices I would need to have, the next task was to create a nice set of codes for advertising & discovery, and to do some testing to see how well the API behaves with the intended usage. The results app can be found at Github.
The apps logic got a bit complicated. I was having good amount of issues with BLE connection & characteristics reading, basically when you design your logic, you should remember:
- Only one request of any kind can be active at any given point
- If there is an error, you must have a short break before re-attempting anything, otherwise the device might get into a state it cannot recover from the error anymore.
- With some devices, it appears that you should run all stuff in one thread (UI thread might be preferred)
- Don’t ever call gatt.close() right after gatt.disconnect(). This will cause internal error inside the API, instead simply call disconnect, and do close inside the onConnectionStateChange callback
And even when remembering all of these you’d end up getting occasional 0x85 / 133 GATT_ERROR errors, which after you should have a break (30-60 seconds should be fine) before you do anything else with BLE.
The logic for the discovery with the example app, is to
- first find the peer via BLE scanning, and then
- to do connection to it, which after we
- request characteristics, find ours and
- read the characteristics value
Important issue here is that we should always start connection freshly discovered peer. Since if we store a discovered peer for later connections, and we get to try connections after it has moved out of reach, we would be getting connect error, which would require us to have a break on API usage, making out discovery to work slower.
Then with initial tests I had 3 devices meeting the specs doing full discovery rounds once in a minute. And did determine that all worked just fine. There were errors with some connections, but re-tries did succeed eventually and I could run the tests for hours without having problems.
Then to make things more interesting, I did also boot up 5 additional phones which were doing only the discovery (they did not have Bluetooth 4.1, and thus could not run the advertising API), and strait away I started to see more error with the original 3 devices. And after running the tests for several days, I did not find any ways on making the logic to work reliably.
In all tests, after just couple of hours of running, one or two of the 3 devices doing both advertising & discovery, started to have error situations where they never recovered. Most often the connection request to the GATT returned the 0x85 / 133 GATT_ERROR, and I did not find any nice ways I could get the app to recover and start succeeding on the discovery.
The issue appears not to happen if we do less work on BLE, thus I changed the logic this in mind. First thing to do here is to save all discovered devices data, and if we see the same device again, we simply use the old data, which indeed reduces the work we do with BLE.
With android the BLE address is changed every now and then, the actual value depends on the manufacturer, but its guaranteed to change after x seconds. Also the BLE address will be changed every time we stop & start the BLE advertising, thus, if we need devices to re-do the full discovery, we have a nice mechanism for getting it done.
Then I decided that same time as I do read for the characteristic on the other device, I could also do write and give it my discovery data. That way it would not need to connect me to get my data, and thus it would be doing less work on BLE. I’m calling this Discovery Push via BLE.
Then I realized that I actually have some bytes I’m not using with the BLE scan record. I only have the flags there, and the full length UUID. Thus I actually had enough space left for adding 6 byte long Bluetooth address in it. Which would allow me to make insecure Bluetooth connections for exchanging the discovery data.
Thus I revised the advertiser code to include the Bluetooth address as a service data, and in discovery parts I added code that does insecure Bluetooth connection to the advertising device, if the BLE discovery would fail. Effectively also reducing the work we do with BLE.
With insecure Bluetooth connections, I could also do discovery push, similarly as I do with BLE characteristics. The problem here is than that we are using the BLE address as an identifier for the peer. With BLE we can see the address inside the callback that gets called when we issue the write. But with insecure Bluetooth, we of course only see the ‘normal’ Bluetooth address in the receiving end. And as I have not figured a way on reading my own BLE address so far, we couldn’t use this method for pushing discovery, unless we can deal with the lack of BLE address.
The following image illustrates the resulting logic for the discovery.
With the final logic, at least I could run the discovery codes for hours without any of the devices going to any irreversible bad state, though I only had 3 devices acting the both advertiser & scanning parts, and 5 additional devices doing the scanning. Thus more testing with higher number of devices would be needed before I would trust the discovery code. Anyway, currently I’m not working on thali, thus that work would be needed to get done by somebody else.