The Role of Big Data in Trade Surveillance and Market Compliance

Aired: Tuesday 22 November 2016 | 14:00 GMT

Today’s European financial markets hardly resemble the ones from 15 years ago. The high speed of electronic trading, the explosion in trading volumes, the diverse range of instruments classes and a proliferation of trading venues pose massive challenges.

With all this complexity, market abuse patterns have also become egregious. Banks are now shelling out millions of euros in fines for market abuse violations. 

In this webinar, we discuss how compliance teams are fighting back with Big Data and trying to stay out of regulatory hot water. Rapid response to suspect trades means compliance teams need to access and visualize trade patterns, real-time and historical data, and be able to efficiently perform trade reconstruction at any point in time.

Join Hortonworks and Arcadia Data in this webinar, where we cover the use cases at a Top 25 Global Bank who now has a deep forensic analysis of trade activity.

In-depth expert presentations by:

  • Paul Isherwood, Head of Platform Development at Lloyds Banking Group
  • Vamsi K Chemitiganti, GM - Financial Services at Hortonworks
  • Shant Hovsepian, Co-Founder and CTO at Arcadia Data

Watch now:


Duncan Paul: Hi. Good morning. Good afternoon. I'd like to thank you all for taking the time to register and attend today's webinar on real-time trade surveillance in financial markets, brought to you by Arcadia Data and Hortonworks.

My name is Duncan Paul. I'm Arcadia Data's ... I run Arcadia Data's business here in EMEA. I'm based in London and I'll be your host for today's proceedings. I'm [00:00:30] thrilled to introduce a great pal who will contribute rich, business, technical, and practical experiences for today's webinar. First, Paul Isherwood, Head of Platform Development, Financial Markets at Lloyd's Banking Group. Paul's day-to-day role is to ensure that the financial markets platform serves all stakeholders, from front office through to support functions; including compliance function, and is therefore responsible for ensuring that [00:01:00] both the current and anticipated future needs pertaining to trade surveillance are met and fit for purpose for all parties.

Next, Vamsi Chemitiganti, General Manager - Financial Services Hortonworks. And, finally, Shant Hovsepian is who Arcadia Data's founder and CTO.

The format for today's session will follow this agenda. First, Paul will outline [00:01:30] the business challenge manifested by the regulatory burden imposed on financial service operators in the marketplace and how this is driving operational changes in terms of data architecture. And then, describe opportunities for real business transformation.

Next, Vamsi will focus on the demand drivers for Big Data in capital markets and why the financial services industry is investing in a more flexible, cost effective data management and analytics platform based [00:02:00] on Hadoop for the scale and the variety of sources that you can now harness.

After that, Shant Hovsepian and Vamsi will both introduce the Arcadia Data and Hortonworks joint visualization and application solution and walk through how this technology is applied in a customer use case.

If you have any questions for any of the panelists, then please do submit them via the chat window. Time has been set aside at [00:02:30] the end for questions, and any questions that aren't picked up during today's webinar will be replied to following the session. So rest assured, we will get to all the questions whether live or otherwise.

So please allow me to introduce, Paul Isherwood.

Paul Isherwood: Thanks Duncan. Thanks for the introduction. I appreciate the invite from Arcadia Data and Hortonworks to speak today. I want to kick off the session maybe with just a slightly deeper view as to who I am, what my team does, so Duncan, maybe you can roll forward [00:03:00] to the first slide. That's great.

So platform development, kind of what do we do and who are we? So really we're a group of, what I would call, commercial technologists that sit within as a main business function, reporting into the financial markets' COO. So in terms of what financial markets does and where it sits within the commercial bank, we form part of the overall commercial bank, which has several entities. So it's a combination of financial [00:03:30] markets, transaction banking, franchise, our CB lending franchise, capital markets, capital markets business. Financial markets is centered in London, but has presence in both the U.S. and APAC. And we're very much a thick business. So products in trading capability and rates credit, FX, money market and repo into the cash and derivatives across those process spectrums, with a cross as it costs sales force cross- [00:04:00] selling to our customer base.

So in terms of what really the PD, the platform development, responsibilities are, it's really our job to understand the CB and financial market strategy in terms of product, client geographies and I guess translate that strategy into platform evolution whether that be people and skills, process and culture, technology. And ensure that we have the right programs and budget in place that aligns to delivering the uplifting capability [00:04:30] we need.

So on a day-to-day basis we work across front office in terms of trading and sales. The support functions, tech and change and other LBG elements whether that be in the wider groups, move to digital, deliver capability and maximize the budget and synergies that we have available to us.

That gives a little bit of an introduction as to who I am and maybe gives a flavor some of the opinions and perspectives I have going forward into the presentation. So Duncan, if you could maybe roll forward to the next slide.

[00:05:00] So, regulation, regulation, regulation is kind of been the watchword for the past couple of years and the work cloud that we're going to have in areas just kind of a non-exhaustive list of regulatory regimes that we're being faced with at the moment and kind of give a bit of a background over the last five years financial institutions have incurred losses from [00:05:30] rogue trading incidents and have been investigated and fined over allegations across a range of market abuses looking at the interbank and FX market manipulation. I think globally to date, and this is probably an out-of-date figure, there's been circa $19 billion worth of fines and if your looking at kind of the UK space, the FC only has issued in excess of 1.4 billion pounds worth of fines relating to these two issues between 2013 and 2015.

So the [00:06:00] current landscape is the regulator is increasing the expecting banks to monitor communications and trading activity to help identify and prevent future instances of market abuse. In terms of how this manifested, there's a raft of regulation which has been issued by a number of regulatory authorities with MAD/MAR and MiFID II kind of being a couple that we'll maybe focus on today, being those two examples to light and how that kind of relates [00:06:30] to financial market businesses and trade surveillance. And really the kind of key watchword that the regulators are looking to kind of refocus on is transparency. That's kind of one word that you would use to articulate what the regulators are looking to gain.

With that in mind, let's take a look at the example of the European regulations and how that's had effect on us recently and some of the regulations which are kind of coming to date, come to light [00:07:00] over the next couple of months and years. So if we take a look at, kind of, MAR and MiFID II, MAR is effective, essentially was effective July 2016 with some provisions actually not taking effect till after they MiFID II date. The MiFID II kind of taking effect as of January 2018. Both pieces of regulation actually existed in a prior form so the Market Abuse Directive was adopted in '03 when the Markets in Financial [00:07:30] Instruments Directive effective as of November '07. But I think it's safe to say that in those times, the regulatory environment was a little bit more laissez-faire in nature. And probably more skewed towards the cash equities business in terms of its slant.

But post the financial crisis, OTC and derivatives really has came very much into sharp focus, so these regulations were very much beefed up. And these current iterations are really looking to, I guess, satisfy [00:08:00] some key objectives. And that's really around strengthening investor protection, reducing the risks of disorderly markets, reducing systemic risks, and increasing the efficiency of financial markets, and reducing unnecessary costs for participants of those markets.

So in terms of how that's really manifested itself, with kind of MAR and MiFID II, there's a number of new obligations that are already kind of hitting financial markets. So there's a real need now to increase the storage of data, whether it be orders, executions, [00:08:30] or prices that you need to hold. And certainly for a significant period of time. I think it's minimum of five years and potentially up to seven years for regulatory investigations. Price transparency, in terms of issuing prices to clients who are of similar class. The SI regime which has come into play with MiFID II and trading on venues, NTS and OTS. And certainly a higher focus on communications monitoring, whether that be voice or E.

And [00:09:00] one of the other key things is the ability to really reconstruct any particular event at any point in time. So that's really, kind of, encapsulated within the trade reconstruction elements of MiFID II. And it's pretty similar in some respects to the market abuse frameworks that firms now have to introduced. And that's typically classified as storage, so it's a suspicious transaction. Or the reports that you need to issue to regulators in the event of any market abuses that you'll see. [00:09:30] So, how do we, or actually, the final point is really the regulatory response times. I think that it's pretty clear now that the regulators are taking a much more, I guess, shorter time to market in terms of receiving responses to regulatory requests. I think the CFTCA are kind of leading the way there with 72 hour response times in certain instances to regulatory requests. So it's very clear that firms need to have access to the data. [00:10:00] And they need to have the right frameworks in place to respond in a very, kind of, agile and don't have wait to any regulatory requests.

So in terms of moving on to the next slide please, Duncan. So how do we look to respond to these challenges? And, as I say, transparency is really the key word that the regulators are going to bring to bear here. So just kind of think here about what data do we need to capture to reconstruct an event at a moment in time? And really this [00:10:30] is talking about capturing as much data as we possibly can.

So if we were looking to reconstruct a point in time in our financial markets in terms of what our trading business was doing, you'd need to be able to capture your current risk positions, what prices you're currently streaming or responding to on an RFQ, what your current P&L's position was, what orders are you working, what news events are out in the market, etc, etc.

So it really means that the architecture, the data architecture [00:11:00] that you need to build, has really got to be able to scale for an enormous amount of data. And very, very, various types of data coming into, into the stack. And it's really got to be quite dynamic in terms of being able to evolve with market practices. And how those different data stores start to come to bear. And then we can start to use, I guess, those data inputs to really kind of understand, and draw correlations, and weightings to how these various events kind of impact how we execute [00:11:30] our business.

And that kind of lends itself, it speaks to, the machine learning capabilities that we're coming to later on in the deck that the Big Data architecture's going to give to us. So really one of the things we were thinking of here at Lloyd's is that, you know, how do we start to really start to understand what is abusive and what is, kind of, not abusive behavior?

So how do we create, I guess, a baseline of normal distribution should we say in terms of what is and isn't abusive? So, we look at certain [00:12:00] asset classes, particularly exchange traded products. It's pretty easy and pretty understandable for us to be able to create a kind of normal distribution of volumes in trading activity. But I guess in the OTC spaces, it's a slightly more difficult and challenging arena to start creating those views on the market. And of course, you've gotta start looking through different lenses. So, how does our trading business function when, we've got a flat risk? Who do we, you know, how do we operate generally in terms of price distribution, [00:12:30] who we're communicating to in the event of us carrying a lot of risk and us trying to reduce risk fed out to the market. How do we behave? And what are our communication patterns around that? And I'm referencing the communication patterns, can we create a social media style kind of graph model of who our traders normally talk to in various instances to maybe identify where there's some abnormal communication patterns?

So there's a real kind of challenge there in that particular space. Because that, [00:13:00] of course, has gone from email communications, chats, and also including voice as well of course. So when you start to create a kind of normal distribution baseline, that certainly will evolve over time and, of course, that also needs to respond to the conditions in the market that we're dealing with and any kind of news events that we're handling at the time.

So you start to create a kind of normal view of what you're business is. And then you can start to understand, well, in terms of what are considered abusive behaviors, [00:13:30] whether it be an abusive squeeze, or creating an artificial kind of floor or ceiling to the market, or potentially colluding. And you can then start to identify from that normal distribution, now where are we, where are we kind of spotting the abnormalities? In OTC space, it's pretty difficult, in some respects, to really kind of start to paint that holistic picture. But it's a challenge of the regulator to set it, so we need to start to be able to understand [00:14:00] in certain situations, for example, in the swap space when we're responding to a quote, what our behavior in the futures market at the time? Can that start to drive up the curve to the detriment of the client?

So there's a lot of information that we need to be able to join together, stitch together both pre and kind of post any form of execution capability. The next point, in terms of first line of defense, trades of any solutions are often kind of pinned toward [00:14:30] the compliance end of the spectrum.

But I think that certainly something that's worthwhile considering is this, it isn't just a compliance job. The front office really are a key stakeholder in this space. The front office has always been accountable for running its responsibility. And in doing so, the front office assumes accountability, acting as a first line of defense for the firm and stopping potential issues at source. And certainly with senior managers' regime, [00:15:00] front office staff may actually be personally liable for wrongdoings. So it's really in, there's a vested interest in the front office to really buy into the whole trade surveillance solution because at the end of the day, they are, they have a huge amount of knowledge and are key stakeholders and a source of information to identify abusive and non-abusive behavior. But in saying that, there's clearly a fine line in terms of the front office can't mark its own homework, so it's a case of compliance and front office working together to really understand [00:15:30] where there's genuine information being fed back from front office to avoid false positives being triggered and being flagged versus actually, you know, is compliance understanding where there's genuine abusive behavior.

And in terms of bringing, sort of business SMBs and other data, you know, that's the one key thing that's, I guess, a lot of, that everyone on the call will kind of understand that there's an enormous amount of data that is now being required to be captured. And your [00:16:00] data architecture really should allow for, I guess, a set of data scientists, shall we say, to have access to and usage of all that data and the data types. And that really starts to maybe challenge some of the data governance policies which, either you essentially need kind of superuser style access to the data to maximize the learnings.

And I think that on the subject of data scientists, real kind of hot topic, a hot job at [00:16:30] the moment in terms of exposure in the market. In terms of, really, what events a data scientist means, maybe in this space. I think it's not just necessarily someone with tech experience. It's really someone who's got a good business background and really applying the technologies, and using these technologies to be able to solve the business problem.

And there's maybe one thing that is worth considering, we do have, certainly within Lloyd's, I'm sure within the other representatives of the organizations that [00:17:00] are on the call, call on research and call on dev teams is possibly an avenue that's worth exploring is to think about how you maybe repurpose some of your sort of capability till we look at moving toward providing a data service modeling capability as well as pricing, risk management models.

And that actually then feeds potentially quite nicely into any form of model governments structures and prices that you have. I think that's an area where, certainly portraying to owners, [00:17:30] it needs to be a clear understanding of, you know, what the models are doing, the algorithms that you're using, how they're being performed, and how efficient they are. And make sure that you've got the right level of governance and compliance around those models as well. In terms of bringing prices to the data, as I said, there's an enormous amount of data so we need to kind of thing about, almost at the genesis of the Big Data world, where MapReduce was really thinking about bringing a process to the data run and shifting data around.

I think the financial institutions have got a great history of moving [00:18:00] hundreds and hundreds of millions of rows of data from point A to B to C, where their being from front office risks through to market risks, through the P&L, and finance, etc, etc. So really we need to kind of adapt this, this mantra of bringing the process to the data. Because it's just too much to move around, really.

In terms of machine learning, and how we can kind of leverage that, I mean, I think the machine learning was almost, that kind of really added a real new, kind of dimension to Big Data. I mean, Big Data, kind of on its own is [00:18:30] kind of a bit like being thrown a telephone directory and saying go make sense of that. But really when the advent of machine learning capabilities really starts to take it to a new dimension.

And I guess, really, just to kind of give a really brief overview, we have a kind of graphic on the side. But really the kind of machine learning capabilities are split into, kind of a supervised and unsupervised model kind of paradigm. With a supervised world, you kind of know the inputs and the outputs. The desired outputs are kind of provided. They're understood. And [00:19:00] the model is trained such that when they're, when presented with a new set of inputs, that then will generate a reasonable prediction. In the unsupervised space, there kind of is a, you know, teaching element to it, or labeling of data, and consequently you're leaving the model to find patterns or discover groupings from the input data, I guess, in isolation.

So in terms of where that kind of leads us, in terms of trade surveillance and what models you looks to leverage, I think it's fair to say that we probably start off with the more, [00:19:30] the simpler end of the spectrum. So we're probably looking at, kind of the supervised model end of the spectrum and probably looking regression analysis such as linear and logistic regression or maybe some classification otherwise such as Naive Bayes or support vector machines.

And then as you get more experience, you can maybe start to move into the unsupervised view of the world, you know leveraging some kind of clustering means or expectation, in population, principal [00:20:00] component analysis which really kind of sits into the dimensionality kind of reduction space as far as all those. But I think the key kind of thing is, you know, start small, start kind of easing into the spectrum. And really it's a case of, you know, train, learn, fail, and iterate, really. But ideally fail small and fail fast.

Then moving on to really the kind of tooling and code visualization. The fact that we've got a nice, shiny new, data lake, not [00:20:30] much good if you can only generate kind of ones and zeroes because they're limited to people that can really understand and access that data. Say, it's more around, we need really, really rich visualization capability and tooling to be able to tell the story of what we're kind of seeing in front of us at any point in time. And whether that be for, you know, a very technically savvy user or a less technically savvy user. So we can maybe roll forward on to the next slide, Duncan.

So, [00:21:00] ROI opportunities. This is kind of speaking to, yeah, we've got a lot of budget. I'll show later on in the course how to, some, you know, multiple zero kind of budget, terms attached to regulatory and reg tech type initiatives. But if you're in a conversation with either ex-comp or C-suite individuals, they certainly don't have any appetite for regulatory risk. And they most certainly don't like orange jumpsuits.

But really the kind of question that comes up is, how can we start delivering to this budget and not just for regulatory [00:21:30] compliance and keeping our firm and our colleagues safe. How can we start to, you know, look at this data and generate some kind of value add services on the back of it? So in terms of coming back to the fact that we've collated an enormous amount of data, there's clearly some opportunities whereby, you know, we've started to give the business the opportunity to really be forensic about how their business operates. So transparency's very much, in [00:22:00] terms of the business and how it operates, is very much, kind of, nirvana for the heads of business as well as for the regulators.

So, can we start to analyze client and customer behavior? Do we get any signals from that customer behavior? Is there anything we can draw upon that? To potentially offer a better or different service around that? You know, what are the kind of market and asset class dynamics that we see in, potentially in terms of order depth or liquidity profiles across certain [00:22:30] venues? And how asset classes are either correlated or co-integrated together? And how does that potentially look to change, maybe, our hedging? Or our goal optimization? And also as a diagnosis tool. So if we're starting to store tic-by-tic data, you know, where do we see our prices start to diverge maybe from the rest of the pack or from the rest of the street? You know, normal circumstances or under stretch conditions? And that can start to lead to some insights in terms of our, kind of, pricing algorithms at the baseline level. [00:23:00] And if we start to include, kind of, sentiment analysis, what can that start to, kind of, give us in terms of maybe some predictive analytic capabilities and capturing some kind of signals from unstructured data?

And again, because we have got all this data, we can start to really alter some real-time BI/MI too. You got the heads of trading, sales, and risk managers into really getting them to understand on a minute by minute or hour by hour basis, you know, what their business is doing, who they're selling product to, where [00:23:30] we're seeing maybe deviations from, you know, a product on a mix in terms of, you know, what are stretchy years. So, it kind of really starts to give an indication of what the businesses are doing because we have a lot of that data available and in place to the users. And then we can really start to think about, if we do have access to this data, you know, is there actually a new kind of service opportunity here?

So, can we start to think about, you know, data as a service to our customers, whether that be [00:24:00] providing customers access to data? Maybe to put you on a monetized basis or give them something, give them an idea on, you know, how they're performing versus their peers. And start to really, start to give them an indication of, and potentially the financial markets to use this data as a vehicle for generating potentially customer flow and monetizing that data out to our customer base.

And I think that finally, the real kind of thing is that, if we start to put in place [00:24:30] this data lake infrastructure, we actually have a fantastic educational resource. So, yeah, we're starting to collate and be able to record and be providing the right tools and provisioning on top of this data, it's really facilitating, you know, not only the business really kind of diagnosing it, where that be a graduate coming to the desk or a more junior trader, or even a more senior trader, in terms of how we've operated on a day by day basis. It really starts to give a fantastic educational resource [00:25:00] as to what we've done, why we've done, and when we've done it. And actually should we look at, maybe, doing things differently.

So in the interest of time I think that sums up from my part of the presentation. Duncan, I'll hand it back over to you to take us through the next slides.

Duncan Paul: Thanks very much, Paul. That was fantastic. This is, we just actually received a question from the group. What I'll do is, I'll just use this really to get everyone's minds working on potential questions that they can [00:25:30] ask. But the question here is: what organizational model should firms adopt to ensure that roles and responsibilities for trade surveillance are understood and that tools, in effect, keep pace with market development? So, I'm just, I'll throw that one out there. We'll keep it to the end. And then I'll move forward now with Vamsi.

Vamsi Chemitiga: Thanks, Duncan. And thanks, Paul. That was pretty enlightening to hear [00:26:00] the side of a practitioner. And what I'll do is take a few minutes to kind of go through, first take a step back within the larger banking industry, and talk about what we're seeing out there as Hortonworks, working with large global banks like Lloyd's and a whole slew of others from a use case and a focus area standpoint. And then I'll spend a minute or two on capital markets before we start, and then we can dive deeper into [00:26:30] technology after that.

So, from an industry standpoint, I would say banking is probably at the forefront of, you know, harnessing Big Data to drive better customer insights, to manage risk better, and to also meet the regulatory mandate as we've been discussing. From my standpoint, when we work with banks in the industry, obviously it's such a massive vertical. But there's subverticals in the large vertical.

And the focus for today, capital markets, but in addition to that, [00:27:00] in areas like retail banking, credit card providers, payment networks, corporate banking and lending, wealth and asset management, and even stock exchanges and areas like hedge funds, there's a massive usage of Big Data technologies starting to show up more and more. And, the focus areas, no matter what segment of banking you operate in, as shown on the right side, are increasingly common use cases as we work with banks.

So, in the areas of risk management, using Big Data [00:27:30] technologies to essentially pull data and process data that do market risk calculations, credit risk calculations, basal risk, FRTB reporting, liquidity risk reporting, etc, etc, are very common use cases. We also see an increased focus on using Big Data towards compliance based workloads, so things like, not just financial reporting, but also anti money laundering compliance is a big use case because you're talking [00:28:00] about, essentially, onboarding terabytes of data that span a few years of transactions. And then being able to run a lot of wide and deep analysis on the data to look for any money laundering or any such compliance violation that may be out there to file your suspicious activity reports.

One of the other interesting points about where Big Data's being adopted is not just in a defensive posture. So areas like compliance, risk, fraud, they're what I like to think of as defensive in the sense that [00:28:30] the banks are, essentially, trying to protect their consumers and themselves from bad actors, but whether they be external or internal. But in areas like digital banking where banks are trying to do more things from the perspective of providing services that span multiple channels to customers. And also to offer a seamless customer experience.

We're starting to see more and more usage of Big Data around the whole digital transformation keyword. As well as [00:29:00] things around providing a single view of the customer so that you can do both offensive and defensive things like understand the client benchmarking, doing things like assigning risk rules and metrics, but also trying to understand what are the segments that this customer might be falling in, whether institutional or private. And then what new products would they be in the market for? And what are the next best actions that they agent or the salesperson can do to ensure better and smoother customer [00:29:30] experience and higher lifetime value to the bank itself? What is a common thread across all of these areas is predictive analytics. And, really, predictive analytics, data science, machine learning, did exist before the advent of Big Data but with the ability to mine larger data sets and the ability to run analytics in place on the data, as Paul pointed out, is a huge advantage that banks are seeing with the advent of Big Data. So, Duncan, if you move to the [00:30:00] next slide.

The three key value drivers. Obviously, you hear a lot about this in banking, specifically around Big Data. Obviously you have the volume challenge where the data sets are now larger. And these are data sets that are newer types of data. So, one of the, some of the things that Paul mentioned, being able to track chat messages, being able to take call data records, and to be able to analyze them. When you take all of this newer [00:30:30] data, newer types of data and huge volumes of data, data that's either regressed or in motion, produced as a byproduct of regular transactions, you have the need to apply more powerful analytics at the data. And this analytics can basically be in-memory analytics to complex in processing, or analytics that are performed and data addressed on years' worth of historical data.

So, one of the key things that Big Data and Hadoop has brought into the market [00:31:00] is the ability to apply any kind of analytics to any kind of data. So you're not just talking about traditional batch mode analytics that you'd see in a EDW or ETL technologies, but you're starting to see more and more usage of real-time. But also being able to run models on data in real-time and also to be able to commingle that with years' worth of historical data to do, you know, to predict fraud or to create a business report or to basically, [00:31:30] you know, understand what the credit risk buckets are that your traders can trade in. So again, we're really seeing the opening up of the whole range of business analytics and capabilities that can be run on this heterogeneous data that's inside a Hadoop data lake.

And the other trend has been, as banks move towards adopting more of the Big Data techniques, the operative principal from a cost standpoint is to use open source technologies, [00:32:00] more and more of these, to essentially move to a model where you get commodity compute on commodity server infrastructures. And using that to basically drive down costs as well while increasing computing capacity. Duncan, next slide.

So taking a little bit of a deeper dive into the capital markets space itselves, so, obviously, banking, and within it, capital markets continues to generate insane amounts of data. So these data producers [00:32:30] are ranging from news producers to electronic trading partners [inaudible 00:32:35], to stock exchanges, etc. And when we start categorizing the different use cases or value drivers where banks are finding tremendous applicability of Big Data techniques, we have areas like the front office. So being able to apply a single view of a client across multiple trading desks, and then being able to apply, from a 360 degree [00:33:00] perspective, how you can market to a client as one entity across different channels. That is key to optimizing profits. Because it helps you cross-sell in a landscape that's increasingly competitive. So with the passage of [inaudible 00:33:13] in North America, we're seeing more and more of reliance from large firms in flow based trading rather than prop trading. So basically having a view that can provide you with a real-time profitability analysis as well as a risk analysis [00:33:30] for institutional clients is key.

The second area that is somewhat hidden in terms of, not really client facing, but also being able to create your trade strategies and your new trade models and being able to test them on years' worth of historical data is key in the sense that, here, banks are not just looking to realize the benefits of volume and velocity and variety but also are looking [00:34:00] to do things in real quick time compared to what they were able to do with older technologies. And other ideas are like sentiment analytics, being able to leverage social media feeds, and other market data feeds to drive trading strategies in an automated manner is starting to be more and more popular, especially in areas like commodities like oil and precious metals as well. And then as we discussed there is increasingly a focus on pulling in all the data, and across capital markets standpoint, [00:34:30] to basically do better risk data aggregation and better risk reporting as well as surveillance reporting.

And finally, the one area that I think is very interesting and still evolving is the ability to look at your data in a newer way that you haven't been able to do so. Because data monetization, being able to perform any kind of analytics on the data that are deeply mathematical, deeply statistical. And then looking to use all of this data, to reimagine the way you're doing [00:35:00] things, or to create new products is always been an interesting area in retail banking but in capital markets it's one of the future areas of exploration going forward. Now, with that being said, I wanted to turn it over to my friend Shant here to dive a little deeper in the technology and to talk about the Arcadia Data and Hortonworks solution.

Shant Hovsepian: Thank you very much, Vamsi. Thank you everyone for joining the call. This is Shant from Arcadia Data. And Paul and Vamsi did a great job of [00:35:30] motivating a lot of the business and technological challenges that are faced right now in this new regulatory climate. But I want to spend some time talking about our company, Arcadia. And how we go about solving the data visualization and business intelligence problems that are also related in this space. And really driving and improving the time to insight. So as a quick background about Arcadia Data, we started the company to help create business value from Big Data. We're a very much enterprise focused company. Lots of [00:36:00] customers in financial services, insurance, across the board. But specifically focusing on global 2000 companies where you have hundreds of concurrent users with high SLAs analyzing extremely large data sets. And the whole value and property that we bring in is to give you high performance and scalable visualization on your Big Data with absolutely no data movement. Next slide please, Duncan.

[00:36:30] So, let's talk about the challenges of trade surveillance in a Big Data world. Specifically, I'd like to call out the wisdom of SMO when they defined MiFID II as well as MAD/MAR. They specifically call out data sources that are required to be present in reporting and a lot of the calculations involved from a risk perspective. So, the regulators are explicitly saying, I need this trade book data, I need this specific electronic communication [00:37:00] data from within a specific time window of when the trade happened.

So when you're going about trying to solve this trade reconstruction problem to do trade surveillance, you need a massive number of data sources that needs to come into the system. So if you look at the slide right now, and you read it kind of from the bottom up, you see the various operational data sources that you are required to maintain and manage.

A lot of that gets loaded into a Hadoop environment these days. It's as, Vamsi did a great job of articulating, a very scalable [00:37:30] and powerful solution, especially the Hortonworks data platform for both at rest and in-memory data sets. And it's also extremely cost effective and scalable. But a lot of what happens in existing deployments and situations, and Paul alluded to this as a culture in capital markets, of data movement. You have an extract that goes into a different system, potentially a data warehouse. Eventually there's a datamart like system. There's a visualization, or a cubing server on top of that. There's essentially a BI visualization tool, [00:38:00] and eventually your end users.

And what happens with all this data movement is you create various levels of data traps where physically you have multiple copies of the same logical data. So this ends up leading to lots of data summarization and you end up having a fidelity loss on your Big Data and that's extremely important to point out as regulators are also now requiring very high fidelity access to data. Just doing analysis on aggregates is no longer sufficient. It also leads to an environment where it's very hard to collaborate. [00:38:30] And you also have a very high security risk because, as we mentioned, the logical semantic data is exactly the same, however, you have all of these physical copies. So now you need to secure and maintain enablements, or entitlements, excuse me, for those various different data sources. And it ends up being a very large and complex system that has a lot of management and operational overhead. Next slide, please.

So how does Arcadia Data help change this? [00:39:00] Next. Arcadia is a converged analytics platform. What this means is a lot of the complexity and layers that you saw on the previous slide, Arcadia converges into a single web-based software platform that gets deployed directly on the Hortonworks data platform. So it sits right next to the data and ensures there's no data movement. This gives you the ability to visualize historical and real-time data in a single platform. You get a closed loop navigation all the way through [00:39:30] granular data. So you're not stuck just visualizing some recent aggregates. You can do very fast ad hoc and iterative analysis which is essential for lots of regulatory and compliance requirements where you're really trying to do a trade reconstruction of a big point in time block order or you're trying to sort of redraw a swap transaction that may have happened. You really need to kind of go back and look at the data. Try to understand what was in the trader's mind and do some investigative analysis. And as well as having a distributed BI and analytics engine [00:40:00] that can really leverage the power of the Hortonworks infrastructure to do your high performance computing and leverage all the hardware that's there. Next slide.

So what makes Arcadia different? It's a powerful and simplified architecture. Having an on-cluster solution is essential for combining with these large data problems, integrating with machine learning capabilities, and really giving the users and the business users really direct access to the data. And leveraging [00:40:30] the investment in the infrastructure that's already there. You can explore quickly and directly.

There's really a balance here that needs to happen between the ability to do ad hoc, iterative analysis looking at raw transaction level or electronic communication data as well as the ability to do high concurrency data modeling. So the ability to visualize and explore directly but also do modeling and cubing in memory for a fast concurrent access is necessary. And then, of course, as we illustrated time and time again throughout [00:41:00] this webinar, the need for advanced analytics. This isn't the data that you're used to analyzing. These aren't the questions that you're required to answer anymore. The regulators, especially, have created a mandate for very fine grained and high cardinality data sets that really require some advanced analytical techniques.

So Arcadia makes it very easy for the end users in a point and click manner to do a lot of advanced analysis. Everything from combining real-time data sources with free text and search [00:41:30] based data, on unstructured data, and then being able to do fast things like event analysis, behavior based segmentation, correlation, forecasting, regression, fit models, a lot of advanced visual functionality. At this point, next slide, Duncan. I'm gonna hand it off, back to, nope. Sorry.

Just a little bit about the Arcadia and Hortonworks integration. As you see here, Arcadia runs right in the Hortonworks data platform. You can connect directly to the Hadoop cluster. Share, collaborate [00:42:00] the visualizations. You get high performance access to HDP leveraging a lot of the great security functionality that's in place both with Apache Atlas and Ranger. The ability to ensure, since there's no data movement, ensure secure and reliable data access. And also the functionality now that you can deploy in the cloud, on-premise, or in hybrid environments. Arcadia has the application layer on top, as well as component that runs directly next to the cluster to help drive a lot of the analysis. Next slide. And I'm gonna hand it over [00:42:30] back to Vamsi right now to talk a little bit about the use case that we have, that we see with our customers around doing the real-time trade surveillance.

Vamsi Chemitiga: Thanks, Shant. So, so some of the themes that Paul and Shant alluded to are, from a historical banking data architecture standpoint, prior to Big Data are still extremely valid for a vast number of institutions. The current landscape in banking, if you will, there's a lack of centralization [00:43:00] of data. And this leads to repeated data duplication. So if you take an application, typically say a risk data aggregation application as an example, you'll see that the massive degree of data is duplicated from system to system.

That leads to multiple inconsistencies on the data from the summary level as well as the transaction levels. And because the fact that the data is independently sourced from different systems, there is issues that deal with [00:43:30] data governance, so namely things like data lineage. And what it makes, it challenging for groups that do analytics is the ability to not just trust the data, but also to be able to have end to end traceability of the data from the report back to the source system. So the other issue that may not, in general, get a lot of highlighting is the fact that there's not just duplication at the data level itself.

There's a lot of duplication at the analytics [00:44:00] scale as well. Because of the fact that different groups within banks perform different risk reporting functions, for instance, credit risk, market risk are typically different groups. Not just the feeds and ingestion of the data themselves but also the calculators also end up being duplicated, you know, in terms of different development frameworks, different programming languages, etc, etc. So in a lot of ways this inhibits an ability to share data easily. And inhibits an ability to have clean data available at quick time. [00:44:30] And also to be able to run reports that are faithful to all the analytics that have been done on them and the sources themselves. So the reconciliation process is typically a large effort, and also presents significant gaps in the quality of the data. So Duncan, next slide.

So from the standpoint of market surveillance specifically. We discussed the fact that the regulations themselves have broad ramifications across a variety of key business functions. These include compliance, [00:45:00] surveillance, compensation policies, etc. But the biggest obstacle, in my mind, is technology. So looking at a large compliance and surveillance architecture, there is a few key business requirements that could be distilled from these business mandates, that drive the downstream IT requirements.

The first one is the need to store a large amount of heterogeneous data. So both MiFID II and MAR mandated the need to perform trade monitoring analysis, not [00:45:30] just on real-time data but also historical data that also potentially spans a few years. So a lot of this data itself is from a range of business systems. Your trade data, evaluation position data, reference data, [inaudible 00:45:44], market data, client data and other front, middle, and back office data as well as some of the newer types of data that Paul alluded to. Things like voice, chat, and other internal communication.

So to sum up, the first key business requirement that drives the technology architecture [00:46:00] is the ability to store a range of cross asset data. Almost any kind of instrument in a cross format manner structured, unstructured data including voice, and also cross venue data, exchange data, OTC data, etc. And also to be able to store this data with a high degree of granularity.

The second requirement is the data storage itself. All of this data needs to be available for an audit period of about five years. So this implies not just being able to store the data, but also putting in place capabilities [00:46:30] that ensure strict governance of the data and also audit trail management. And also for the recordkeeping requirements, it's important to be able to pull the data in and answer a regulatory query, and depending on the geo, as Paul said, it's probably three days out, in, or 72 hours in the European union. In Canada, it's about a week. In North America, in the US, it's about five days. So being able to access data that's in storage, [00:47:00] and to be able to pull that in and to run some analytics on the data to provide a report is key. And the other requirement is to basically be able to do real-time surveillance and monitoring of the data. So, the need to collect data that's freshly produced adds roles of trading operations and to be able to monitor the data around five seconds, I think is the time frame that the SMO has called out in the MiFID II paper. [00:47:30] You have to be able to ensure that you can track a trade as it moves from an order to a settle trade. And to be able to look for any patterns in this, which imply market abuse violations is key.

And finally, being able to run business rules on the data. So, business rules are essentially an if-then-else construct which looks for a pattern in the data, in the trade data. And to be able to look for any pattern that breaches a threshold is key. To be able to provide that [00:48:00] mode of a programmatic interface. And finally, I think Paul spent a good bit of time educating us about the need to use a variety of supervised and unsupervised learning approaches to be able to do behavior modeling and segmentation.

But this is key because one of the themes we're seeing out in banking, not just in compliance, but also in areas like AML and fraud detection, is that the people that are coming into fraud are [00:48:30] deeply trained in how financial systems work and regulations work. So they're able to fool the system that uses, any system that uses a base rules engine approach because rules are themselves static. So what we have to do here is then use machine learning to basically look for any behavioral pattern of a trader that connotes some kind of an outlier behavior that's not usually expected. And that could potentially raise a, you know, a suspicious report that [00:49:00] needs to be investigated. And finally, what organizations are looking to, around using Big Data is to basically do multiple use cases. Not just compliance, or risk data aggregation but also to be able to use all this data that's been collected to drive things around a single view of a client that can show all of the client's positions across multiple desks, what their risk position is, what their QIC score is, and what the next best action for that client is, etc, etc.

So from a design standpoint, [00:49:30] what we're proposing here, all three of us, is essentially being able to look at a Big Data architecture that can ingest from tens of millions to billions of market events spanning a range of financial instruments. And this data can be ingested using any of the different approaches that a bank might have in an existing way, or to be able to use some of the Big Data architecture provided tools like Sqoop, Kafka, Flume, etc. But then being able to add new business [00:50:00] rules on the data. And then use languages like Python or R to basically perform exploratory data analysis on the data. And also to be able to add more capabilities as different patterns of things like front-running, pumping and dumping, [inaudible 00:50:21] stuffing, etc, etc are all mentioned out there that the regulation also does a great job of writing about. And being able to use these advanced techniques to help [00:50:30] front compliance analysts. So with that in mind, let me just pass it off to Shant for a minute to talk about the compliance toolbox from a UI visualization standpoint.

Shant Hovsepian: Thanks, Vamsi. So, we learned a lot about the great business motivation and drivers for this as well as some of the technology challenges. But what does it all look like? This is the compliance officer toolbox. It gives you the ability to do very fast attribute filtering. This is essential even in situations where you can pre-compute all attributes. But when you also have to do some ad hoc discovery of these attributes, so being able to drill [00:51:00] up and down equity classes, or deep dive into trading desks, as well as the ability to always drill down, click some content to get to the raw data. Look at exactly what your P&Ls look like at the moment. Or what the RFPs or RFQs look like in the situation. And really all this is to go and build a complete picture of trade history quickly across all the markets and exchanges and do order flow reconstruction you can see from the visualization perspective. Next slide, please.

[00:51:30] So to sum up, what are the essential properties that you need for successfully mandating a lot of the regulatory requirements? You need to be able to inspect and issue ad hoc queries with very fast filtering across multiple attributes. Incorporate unstructured data, this is the electronic communication data that's essential from a regulatory requirement to recreate a true point in time picture of trader activity. So being able to leverage free text search engines. Things like SOLR provided in the Hortonworks data platform as well as the ability to combine historic and [00:52:00] real-time data visually to correlate current activities with historic trends. Malicious traders, you know, pump and dump, are great situations. As well as then, the ability to embed static and interactive visualizations into case management applications.

A lot of times these regulatory requirements, they boil down to investigating a situation that may happen, and you really need to provide proof and evidence, and no one wants to see an Excel spreadsheet these days, but the ability to go back and give your users or the management an interactive visualization [00:52:30] that they can use to trace the evidence and understand why a case was open and closed. The ability to do email alerting on metric changes. Iterative analysis on subsets of derived data without the need to extract to a spreadsheet. So being able to do iterative analysis without having to dump out to fresh files and re-upload them into the system. And then quickly retrace activity around, you know, large block transactions in a point and click manner. Next slide.

So, just, to sum up for everybody. [00:53:00] Paul, Vamsi, and myself presented a lot of the challenges in data situations that you can leverage using Big Data in a regulatory requirements. To learn more I encourage you to check out some of the resources that we have online. And, you know, Vamsi manages a great blog. There's the Hortonworks blog about financial services as well as, our blog over there. I'm going to hand it back to Duncan to field some of the questions that we got from the audience.

Duncan Paul: Thanks a lot, Shant. So, let me go [00:53:30] back to that original question that we had in for Paul which related to the organizational model. What organizational model should firms adopt to to ensure that roles and responsibilities for trade surveillance are understood and that the tools continue to keep pace with market developments?

Paul Isherwood: OK. Thanks for the question. I think that we kind of touched upon this in the presentation in terms of, I think that there's a, certainly a joint responsibility and stakeholders in terms of front office and compliance are really [00:54:00] operating in a collaborative fashion to understand what really would constitute an alert for an abusive behavior and what not. I think from an organizational perspective, what that really demands is that you've got a group which is both business and technology where, that can sit, in many instances potentially within a front office-like capacity but with a clear, kind of, dotted line to compliance around what is it that we're finding in our data? What are the models that we're creating? How is this being represented in, [00:54:30] both within the data lake space, and also within the visualization capabilities to really kind of drive forward the maximizing the usage of both the data lake capabilities and visualization tools. So that really becomes almost like a group of, I guess, data scientists which can really, I guess do two things.

One, start to really provide you the underlying kind of infrastructure and capabilities, and maximizing that infrastructure. And then, two, [00:55:00] as sort of act as kind of data champions and start to articulate and understand potentially where there's some, maybe, business value add services we'd consider solving to this data as well. And I think that, from a, kind of, general data governance perspective, it's just gonna be a realization from CEOs and kind of data organizations that, you know, there will be instances where you do kind of need that superuser capability to really drill down into multiple sources of data, whether that be voice, e-trade, etc, etc [00:55:30] to really maximize, kind of, the value proposition from a trades analyst perspective. And also again, from a potential client value add perspective as well.

Duncan Paul: Fantastic. I'm sure Shant and Vamsi won't mind if I put like, what looks to be the last question back to you again, Paul. And this one is more to do with your experience, you know, what's your experience in transitioning your IT and business teams to a Big Data solution in lieu [00:56:00] of, or in combination with data warehousing platforms?

Paul Isherwood: OK. I think that there's a, I guess, a couple angles there. I think that the experience has been very positive. I think that if, you know, you start to tell the business that we're really capturing the data in its raw form. So going back to that lambda architecture approach where you take the data in, in its raw form and you have the ability to, you know, create models on the top of that data [00:56:30] but usually find actually that you may be missing some insights, then you can go back and actually change the models and change how you're analyzing that data. You're not normalizing data out at the point of ingest. That was a fantastic kind of cheer from the business to a certain extent there.

And I think that, but the other angle is that really, from a technology perspective is that there's been a lot of investment in, kind of, traditional data warehousing platforms. I think that it's safe to say that if you come in with a picture of, well, this is a silver bullet for everything. Then maybe you're [00:57:00] not gonna get heard with the respect that you deserve. So I think that really it's a case of, what I would advise someone in terms of taking a Big Data or data lake type platform into their organization is, really start to look at where are the examples and what are the use cases where the real advantage of this technology can be truly exposed?

So I think that from a business perspective, whole hearted kind of understanding of the power of the platform and what it gives to us. And then the fact [00:57:30] that we can capture multi-years worth of data that just give us a huge amount of insight and ability to regression test what our businesses are doing. And I think that from a technology perspective there's a clear kind of view of that, you know, that the Big Data technologies really have a key element to play in this space in terms of their, the dynamism that they can bring to your technology environment. And there's a huge amount of movement in this space and the tooling available really is, [00:58:00] you know, a great capability for the front office.

Duncan Paul: That's fantastic. And I just wanted to wrap up there and apologize to any of the other outstanding questions. We'll pick those up directly and come back to you. But just as a thank you to Paul. Thank you very much Vamsi and Shant for the insight and the contribution. I've thoroughly enjoyed it. And what just remains is for me to thank everybody for their attendance today. [00:58:30] And we look forward to speaking with you afterwards. And the recording will go out on our respective websites in due course. Thank you very much.

Vamsi Chemitiga: Thanks everyone.

Shant Hovsepian: Thank you everybody.