Automating Literature Mining and Triangulation with AI and Knowledge Graphs

University of Bristol

About the Project

We seek a postgraduate researcher with an interest in the use of computational and data science methods for data mining of biomedical literature and evidence triangulation. You will join an interdisciplinary research team providing a rich and dynamic training environment as part of an academic-industry collaboration. This project will apply cutting-edge artificial intelligence (AI) approaches to address “information overload” in the pharmaceutical industry.

This 4-year University of Bristol PhD scholarship is fully funded by Roche a global biotech company, a leading provider of in-vitro diagnostics and a global supplier of transformative innovddative solutions across major disease areas. The project will be jointly supervised by an interdisciplinary team from the MRC Integrative Epidemiology Unit (MRC IEU) and from the Department of Engineering Mathematics in University of Bristol as well as from Roche.

Background:

Around 1.7 million new articles were published in PubMed in 2022, with a further 47,000 pre-prints posted on bioRxiv and medRxiv in the same period. Some of these contain information relevant to drug development, but filtering out the numerous less relevant publications is a significant challenge. In this PhD project you will develop novel natural language processing tools and methods to address this challenge, working with pharmaceutical industry stakeholders to tailor these to their needs.

Project aims:

The aim of this project is to develop and apply NLP tools (including language models and other machine learning methods) and evidence triangulation to improve the efficiency and accuracy of knowledge extraction from biomedical literature. This will be achieved through 3 objectives:

  • Objective 1: improve the recognition of named entities (NER) and assertions in biomedical texts
  • Objective 2: develop approaches to integrate and triangulate literature assertions with other biomedical evidence using existing knowledge graphs and other bioinformatic, clinical or real-world evidence.
  • Objective 3: develop ways to use your tools to rank and filter both assertions and their linked publications according to pharmaceutical user requirements.

Methods:

You will apply machine learning and natural language processing methods including traditional methods (such as Word2Vec) and also modern approaches i.e. large language models (such as BERT and LLaMA) to develop named entity recognition methods based on biomedical terminologies (e.g. ICD, MeSH, UMLS, etc.) and ontologies (e.g. EFO, HPO, GO, etc). You will also use existing data to develop a bespoke terminology for pharmaceutical industry users, and evaluate the performance of this in named entity recognition against more established terminologies and ontologies.

You will use a variety of data resources from biomedical knowledge bases (such as EpiGraphDB, Open Targets, etc.) and databases containing specific evidence (such as OpenGWAS, Chembl, DrugBank, STRING, Reactome, etc.) to investigate the triangulation of bioinformatic evidence against the generated assertions you extract from literature text. You will also work with several database technologies such as Neo4j and PostgreSQL as well as web technologies such as GraphQL.

Finally, you will use existing data comprising pharma-relevant publications to develop approaches to prioritise unseen publications based on the assertions you have extracted and other knowledge you have integrated. This will involve using machine learning for both regression and classification of assertions and publications with the aim of producing a ranked subset of relevant assertions and publications for the end user. End users may range from researchers who seek to improve diagnoses or therapies to physicians who need to select the most promising treatments or clinical trials for their patients.

Candidate requirements:

We strongly encourage applications from STEM and/or health disciplines with experience in computer / data science (e.g., mathematics, statistics, computer science, life or natural sciences, economics, social sciences or other related quantitative discipline). You will need to demonstrate your ability in conducting research using computational methods or a strong motivation in learning those methods.

Applications are sought from high performing individuals who have, or are expected to obtain, at least a 2.1 degree (or equivalent). Possession of a relevant Master’s degree or research experience would be advantageous but is not required.

We welcome applications from those with non-standard qualifications who can demonstrate knowledge, experience and skills developed in the workplace, or elsewhere, relevant to the programme of study.

How to apply:

When applying, candidates must select the Population Health PhD programme and enter supervisor names as listed under the project title for which they are applying. Please state IEU funding in the funding box. Full details on what to include in your application can be found in the Admissions Statement.

Personal statement: Please also provide a personal statement that describes your training and experience so far, your motivation for doing a PhD, your motivations for applying to the University of Bristol, and why you think we should select you. We are keen to support applicants from minority and under-represented backgrounds (based on protected characteristics) and those who have experienced other challenges or disadvantages. We encourage you to use your personal statement to ensure we can take these factors into account.

Funding Notes:

The studentship is fully funded by Roche for four years. The funding covers tuition fees for home students, a stipend at UKRI rates (currently £19,237 for 2024/25), a stipend uplift of £4000 per year and a budget of £12,000 for research costs across the four years.

Overseas students are welcome to apply but you must pay the difference between the home and overseas fees. You must state clearly on your application how you will be paying the difference.

Deadline:

Applications for this project will close at 12pm on Thursday 3rd October. Interviews will be in October. The anticipated start date is the 13th January 2025.

Contact:

For queries regarding the project please contact Yi Liu (yi6240.liu[at]bristol.ac.uk). For queries regarding the PhD application and admission process please contact .

To help us track our recruitment effort, please indicate in your email – cover/motivation letter where (globalvacancies.org) you saw this job posting.

Job Location