Authors: Anshu Kumar
Abstract: This paper examines the inherent bias that exists in Large Language Models (LLMs), because they are designed using digital data, which is created primarily by dominant social classes. The paper contends that AI technology allows the occurrence of ‘Algorithmic Marginalization’, which occurs when the preference for standardized linguistic forms, including formal English or Sanskritized Hindi or any other regional language effectively excludes the subaltern dialects and oral coding traditions of the marginalized groups. Through misclassification of these non-standardized inputs as ‘low-quality’ or ‘inferior,’ Large Language Models (LLMs) effectively ignores the experiences of a large segment of the Indian population. As a result, social scientists who make use of these AI technologies run the risk of using data sets, which may not be accurate and representative of the diverse segments of the Indian society because Indian historians and researchers mostly belonged to upper dominant castes and wrote in a manner that portrayed the superiority of the Indian society over western societies.