Microsoft’s director of strategic engagements shows off company’s work applying NLP to genetics and other issues impacting humanity
The hall of several hundred people listened attentively as the speaker explained in perfectly summarized, yet pointed detail how big data was turning over the biotech industry. She was showing off the way that one of her startups, Miroculus, had begun to use natural language processing (NLP) to match microRNA — a fraction of DNA that prevents protein production, thus triggering certain genes.
“In order to have these genes observable and expressed, there are processes,” Limor Lahiani recently said on stage at Geektime’s DevFest, describing the process of developing APIs to analyze the corpus of data, manage the extraction, and then classify the information.
Lahiani is director of strategic engagements and manager of Microsoft’s Partner Catalyst program, a partnership scheme with startups and promising developers to work on cutting edge tech projects. To date, they have largely focused on big data.
She is hunting for difference-making technologies, either from a philanthropic perspective or that will pioneer new trends in technology.
Miroculus has built an API tasked with entity extraction, pointing out and segmenting data from tens of thousands of papers, then graphing the data.
She described the entire process, pointing out various startups that have worked on different elements of the project: scheduling, delta querying, document processing, classifying, connecting relationships, then graphing the data.
At the heart of their work is building the NLP classifier, which segments language based on a “bag of words,” syntax and word embedding.
Altogether, Partner Catalyst’s collaborating startups have built an API on the back of training machine learning models to cluster information and detect anomalies (supervised learning), and predict results and extrapolate conclusions (unsupervised learning) based on the data presented.
A number of open-source efforts are in on different segments of the big data/NLP project: Spacy.io covers syntactic tagging, programming community Scikit-Learn deciphers ‘bags of words,’ and open-source topic modeling toolkit Gensim handles word embedding. Python library Textblob is used to break down sentences.
Most importantly, they also use GNAT to match miRNA with genes in the National Center for Biotechnology Information (NCBI) database.
Some of those data segment achievements include filtering out previous research though relevant to certain diseases or genetic markers, as well as connecting microRNA markers to genes known to cause said diseases (such as the connection of microRNA-146a to BRCA-1, the latter a gene long-connected to breast cancer in women).
Microsoft’s outreach to problem solvers
Partner Catalyst has five main centers for its team: one in Israel, one in the UK, and three in the US. They have open-sourced a lot of their code on GitHub.
“I’m part of a global team. A lot of companies Microsoft is working with are really engaging together to solve a problem together. What are the new problems that are out there?” Lahiani asks.
Lahiani has been with Microsoft since 2008 and served as a senior or principle software engineer for Innovational Labs, Bing Mobile IL, App Recommendations, and Cortana.
She says she prefers working with startups and innovation-specific teams instead of product teams. Why? Product teams have to focus all their creativity in a box, whereas startup teams usually look to make something new and unchallenged, something marketable based on its uniqueness.
“We’re kind of jumping from one problem to another, interesting and innovative problems that can be learned and applied to others.”
The new job gave her the chance to retrace her steps to worth with Miroculus. She has a long list of startups on her roledex in Israel, including computer vision company Percepto and defense and AI company Roboteam. There are more, but they are still operating under the radar.
“I’m fascinated with problems that actually matter. I really like to have an emotional connection to the problem or the team, [something] meaningful to people. I’m trying not to do just another ad click,” Lahiani says.
From the smart era to the intelligence era
Recruiting teams to donate time toward these projects isn’t hard, she argues. If they don’t have the time, they will know right away. Otherwise, they see the altruistic value, or brand value, in being part of such a project.
“It’s not me convincing them. Its a win-win situation. If we find a mutual interest we engage. I really believe it’s about the people themselves.
“We always have a skin in the game, that they’re also invested. That their people are also working with us. We want to learn not just how to solve the problems,” she notes.
She points to chatbots: not such a sexy topic to the layman, but a big deal in the world of language processing.
“We are learning to speak naturally with software. If we used to unlearn our natural language when we had search engines, now we’re kind of relearning we can speak in a more natural way to access information.”
Their data is equipped to especially deal with genetic data, DNA, and diseases. They only gather information from peer-reviewed articles, so no fuzzy news reports interfere with the data.
“The next step is creating a confidence level based on some features of the research itself,” she says. “By the way, for example, if you’re analyzing law documents, case law, they’re trusted. No one fakes any data.”
It’s a curious parallel though, as the New York Times recently showed how the sugar industry had bribed researchers for decades to deflect blame to fat for developing health problems in the US. That might undermine an entire corpus of research built off certain assumptions. If that were to happen, it would take a shakeup of the algorithm to correct it.
So too have legal experts been found to have biases or been in the pockets of defendants, later disqualifying much of their case law.
While the AI isn’t so sophisticated yet to pick out news that indicates certain sources might be suspect or unreliable, it would not take much for data scientists to fix their algorithms to take these things into account.
“The way you’re modeling the data represents the trustworthiness of a judge for example, you need to retrain the model. He’s not that trustworthy. I have to retrain the data,” she asserts.
Data scientists invest the bulk of their time in feature learning, retraining models until the day AI takes the next big, deep step.
But even ahead of that, Lahiani is certain we have entered a new kind of information age, one that moves from simply dealing with ‘smart’ innovations toward one where we can begin to extrapolate wisdom.
“We’re moving from the information era to the intelligence era where we take the data we have collected for decades . . . and build models to really understand it; to make the internet intelligent and to consume it in an intelligent way.”