This research paper introduces SocialGPT, a new modular framework that leverages the perception capabilities of vision foundation models (VFMs) and the reasoning capabilities of large language models (LLMs) to identify social relationships between people in images. Unlike previous methods that train a dedicated network end-to-end, SocialGPT translates image content into a textual social story using VFMs, which is then used for text-based reasoning with LLMs. The paper also proposes a novel prompt optimization method called Greedy Segment Prompt Optimization (GSPO), which helps improve the performance of LLMs by performing a greedy search on the segment level with gradient guidance. SocialGPT achieves highly competitive results on two datasets without additional model training and provides interpretable answers, offering language-based explanations for the decisions.