LLM:
Experimented with Claude and Gemini models. The result yielded both models behaving more or less the same way, except Gemini had a bigger context window and lower cost to operate with slightly better multimodal understanding than Claude.
Models experimented with:
- Gemini 2.5 Pro
- Gemini 2.5 Flash
- Gemini 2.0 Flash Lite
- Claude 4 Sonnet
Among the gemini family of models, Gemini 2.5 Flash and Gemini 2.5 Pro performed equally except for the fact that Gemini 2.5 Pro required more time due to its Reasoning capabilities and also had high cost associated with it.
Overlay:
- Initially I tried capturing bounding box of selected elements such as buttons, anchor tags, links and selected divs. Each of the bbox would have a index associated with that element’s ID/class/label (anything unique for that element) and then this list of indices along with the overlay screen shot of bbox (along with indexes displayed over them) were provided to the model. And model was provided a special tool called click_by_index which would take the index of the element the model wants to click. But in practice the index to ID/Class/label mapping was difficult to maintain as many elements had common identifiers which introduced ambiguity. And if a new element was introduced due to some previous action it became challenge to re index the map and provide it to the LLM.
- After this I decided to overlay a coordinate grid over the entire screen, this would involve a graph like grid being displayed over the screen with both x and y cords having a unit of 0.1. This would allow model to have reference to guess the cords in case it required them to click some element. But in practice the model was not good at guessing the cords all that well.
- At last I decided to combine the best of two by overlaying the bounding box along with its actual cords calculated so that model does not have to guess them. And this yielded better results than above two appraoches.