Julian Chan is the firm’s Lead Data Scientist. His experience and expertise include both developing practical strategies and effective economic and financial solutions and working with big data. Here Dr. Chan talks about data science, big data, and why we approach data-heavy matters with a different mindset.
Q. What is data science?
A. There’s no exact definition of data science. It’s a combination of statistics, computer science, and subject matter expertise (finance and economics, in our context). Data science is about solving problems and seeking answers with data. Bates White has been using data to analyze problems and find solutions for a long time. But data science has become more popular in recent years because of the availability of lots of data and the insights it provides to help inform litigation strategies, improvement in computation power, and accessibility of tools. For example, today, infrastructure and software in the cloud allows us to analyze data sets that are too big or too complex to analyze using traditional hardware. Similarly, data science tools allow us to broaden the definition of data to include things like text and other documents.
Q. Does Bates White approach analyses that involve big data any differently?
A. No and yes. No in the sense that the main services we provide are similar, regardless of the size of the data. We provide expert opinions using the data and help clients extract valuable insight from that data. Yes in the sense that the methods we use to analyze big data are different. For example, it is infeasible to analyze 1 terabyte (TB) of data with Stata (a common tool for economic consulting analysis), but it is not only feasible but pretty efficient to process 1 TB of data using big data tools like R and Spark. Advancements in machine learning and artificial intelligence allow us to process more complicated data such as text and images. The new technology also allows us to solve more complicated problems for clients.
The volume of data does have a role in planning work on a case—it will influence the best tools and software to use. When we have 50–100 gigabytes, we need to start considering using different tools than we use for smaller amounts of data, because older tools can slow down our analysis with larger data sets. It’s like shoveling snow: you’re going to use bigger tool to shovel a football field than to shovel your front sidewalk.
Q. Can you expand on that a bit?
A. The computation difficulty increases exponentially as data size increases; for example, if the data size doubles, it is more than twice as hard to process the data. As I said, in data-heavy matters, the end goal is the same as with “regular” matters: learning valuable information from the data to solve our clients’ problems. What’s different in these matters is that we have to consider different or additional tools to analyze the big data as efficiently as possible. That includes determining the best tools and procedures, spotting potential problems ahead of time, and even budgeting—understanding how much time we need for computer run time and processing time and what kind of technical support we need and ensure our clients understand the cost implications of these choices. When we have a lot of data, a small improvement can make a big difference in our analysis and in the cost of completing that analysis.
For example, we have adopted data science tools like Spark and Python on our matters to analyze the data much more efficiently than we could have using traditional tools. With Spark, we can process hundreds of TB of data quickly and efficiently, whereas the more traditional tools would have crashed or bogged down just trying to open data of this size. These improvements give us more time to work on the analysis, improve the work quality, and deliver value for our clients.
Q. Your role as Lead Data Scientist in the firm is unique. How do you leverage your expertise in this role?
A. Besides economics and econometrics (statistics), I’ve always been interested in computer science, which allows me to look at big data problems from a different angle. That combination of economics, statistics, and computer science knowledge helps me ask the right economic questions and identify the right statistical model and computer science tools to answer the questions.
As an economist, I work with the teams to provide economic opinions to clients. As a data scientist, I enjoy helping other economists on computation challenges. I identify the appropriate computer science method to overcome big data challenges. It’s not just a technical perspective. It requires that I understand what kinds of problems we are seeing. And I need to understand what the client wants, so I can help the team find the right approach to solve that problem and plan accordingly.
In this role, I also lead the Data Science Committee, which helps the firm in different ways, especially identifying and understand next-generation data science tools and diffusing knowledge about those tools. We have found our employees are able to build on their existing data analysis skills and quickly develop expertise using the data science tools. One sign of this has been the progression over the past couple of years where I used to have to work on effectively all the firm’s big data matters. However, as we have more and more of these cases, I can pass my experience and knowledge on to other team members, who are then well positioned to lead the next big data cases.
Q. Can you give a couple of examples of matters actually using data science tools?
A. Of course. We recently worked on a matter involving an app that tracked location data. We received well over 20 TBs of data and decided to use Spark to process and analyze the data. The team optimized the workstream so that the longer run time of our code wouldn’t be a bottleneck to the analysis. All these improvements gave us more time to work on the analysis, improve the work quality, and deliver value for our clients.
A slightly older matter using these big data tools is the ResCap case. In that matter, the scale of the data—which covered approximately 2 million loans—was huge. The analysis required running simulations over time so that each loan became 30 years of data; the simulations were then run 10,000 times, resulting in trillions of observations. Distributing the computation work among a cluster of computers allowed us to significantly speed up the simulation/statistical computing. It made it possible for us to run a complicated analysis efficiently, allowing us to meet the court deadline and client budgets.
Q. Do you have any concluding thoughts?
A. The value of data science—what I do—is in picking the right tools for the data and the situation. The more options I can offer, the better the team can perform for the client. The technical aspect is important, but it’s not the most important consideration. As a data scientist and economist at Bates White, I hope my “data scientist” role can keep the team from having to worry about technical aspects so they can focus on what they do best.