I went to the NeurIPS 2019 conference in December and focused on NLP and reinforcement learning (RL) topics. The former is what I do for work, analyzing call center conversations, understanding what works for customer interactions, and make suggestions to clients based on their data. The latter is my personal interest, started all the way back when DeepMind beat the world’s best Go players. At the RL sessions, the tutorials mentioned using imitation learning to do natural language understanding tasks and generate responses to questions or chit-chat. People have some success, but it has some of the common pitfalls like repetitive usage of the most likely responses, and responses being too short. And sometimes the response is too simple and the bots fall into a cycle of “I don’t understand what you are saying”. So one night at a social gathering, I did get to meet Drs. David Silver and Richard Sutton. They briefly mentioned that I could try to set up the RL environment like a conversation and see if the agents can learn from the conversations. And in one of the workshops related to NLG conversations, people have talked about using various rewards to penalize for repetitiveness and encourage the generation of different texts. So that got me thinking. In addition to making a chatbot that’s similar to a call-center agent and customer interactions, I can design an environment that helps me discover the reason why people are calling. What I can do is set up categories of actions that are similar to “give refund”, “cancel service”, “keep the customer on the phone” etc, and use (regret = current action – best action), when the reward is either made the customer happy for 1 or made customer mad for 0. This may help me find the best action in a specific situation that results in the most wanted outcome given by the client. Granted, this might not give me the causal reasons why did the customer call directly, but it’s more similar to me designing a potential causal relationship graph before making the model and test if the causal relationship is correct. Even if it does not give me the exact cause, if it gives me the best action to take, then at least I have a product that does what the clients want. So the goal of the new year is to design an environment so I can test this idea and see if I can get the best type of action to take.
To get myself familiarize with the Gym framework from Open.Ai, I set up my own card game environment and made a double deep q-learning network. So here are some resources if you want to start your own:
Taking some design hints from https://github.com/zmcx16/OpenAI-Gym-Hearts
Some more help to design the environment: https://datascience.stackexchange.com/questions/28858/card-game-for-gym-reward-shaping
Other projects using gym: https://awesomeopensource.com/projects/openai-gym
Probably already programmed here: https://awesomeopensource.com/project/datamllab/rlcard
Create your own gym environments: https://github.com/openai/gym/blob/master/docs/environments.md https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html https://github.com/openai/gym/blob/master/docs/creating-environments.md https://towardsdatascience.com/creating-a-custom-openai-gym-environment-for-stock-trading-be532be3910e https://medium.com/@apoddar573/making-your-own-custom-environment-in-gym-c3b65ff8cdaa
For the environment I look for in the conversation task, I will need to have an environment that mimics a conversation. A reset would result in starting a new conversation. Render is just how the conversation carried before a specific point. Each step can be a turn of the conversation, with texts randomly chosen from the same category pool. A downside I can see with this approach is that the training data might not generate the best solution to the customer service session. It may just be the bare minimum to get to the desired outcome. But I hope that the fact I can design my own regret in an RL framework, I can penalize for things like the length of the conversation or sentiment/emotional outcome, while I’m trying to achieve the outcome of retaining a customer.
I have encountered a similar situation before when I was making a chatbot using Rasa. They have a simpler RL environment, where a user can choose which route to take in a certain situation. But when I used it, the policy was too simple and does not achieve what I want, especially not give a causal relationship. I hope this could be integrated into this framework and be more useful.