Brute force is, as the name suggests, an approach to solving problems that relies on sheer weight of numbers to solve problems rather than clever algorithms. Computers are capable of phenomenal computational power, so why not just deploy this to keep guessing until you get the right answer? This approach can be super powerful, however, this is no silver bullet. As the programmers, you still have a job to do identifying and defining constraints that can narrow down state space.
Looker, the business intelligence tool, integrates with version control tools like git to allow teams to work together on building models. The default case is for developer user to have their own branch, and for end users to see models from the main production branch. For small projects and changes this works fine, however, the limitations of this become apparent quite quickly. For example, changes to existing models that require UAT are very difficult to deal with.
We use SymmetricDs to transfer data from Azure to Redshift. For a number of unglamorous reasons, we have a scheduled full reload 3 times a day to bring our Redshift db into line with Azure. One issue we have is knowing when the full reload has finished (and hence when to merge staging into the main tables). To solve this I wrote a state machine in F# which keeps track of the state of SymmetricDs.
I like my architecture to be as decentalized as possible. Systems that communicate through message passing and message queues allow components to be more easily managed, maintained and scaled. In this sprit, the various parts of my data integration tool communicate primarily through a central messaging queue. While this is great for the system, a messaging queue can be hard for people to interact with. Often it requires digging into system settings (MSMQ), web interfaces (SQS) or using programming languages.
We use the business intelligence tool Looker to give users access to our data. Generally, it performs a good job of providing a nice front end for the data, but we have experienced difficulties with the cacheing of views. Looker works by having a set of “views” which are essentially abstractions of sql queries. Each view draws on data from a combination of sql tables and other views. In our set up we have a lot of heirarchial links between views in the spirit of DRY.
The presentation will be done through the medium of Jupyter notebooks. Below are different notebooks for varying staging of completeness. I recommend trying to complete everything yourself as we go along, but if you get stuck or lost then you can jump to a more completed work book. Data Download here Designed to be roughly similar to DCM path to conversion data. Each user has an id and between one and two rows.
In a previous post we talking about how model averaging allows us to combine models together to produce estimates that are better than those of any individual model. There I stated that the actual mechanics are needlessly complicated to do by hand and that we should outsource to a package. We use use pyBMA to do this for survival data in python. The module is based on the R package BMA.
MCMC is essential to fitting Bayesian models where there aren’t closed form solutions. Without it it would not be feasible to move past simple models using conjugate priors. While it is possible to learn to use it purely as a black box, doing so risks limiting one’s understanding in the long run. Moreover, it provides a great deal of intellectual satisfaction. Given there are already excellent resources to learn more about MCMC, I’m not sure I can provide more value added than directing you to the following links.
Two common problems in statistics are model selection and robustness testing. A big issue is that the usefulness of a test statistic is conditioned on the model being correctly specified. This means that we cannot rely on simple tests. One board category of solutions to this problems is to try a bunch of different models and somehow compare them. In practice this normally involves either sequentially adding or taking away covariates and seeing how the overall model performance changes.
To fit Cox Proportional Hazards we have to maximize the partial likihood of beta. Thankfully, we don’t have to face up to this somewhat menacing optimization ourselves because we can use the great package lifelines written by Cam Davidson-Pilon. I recommend the tutorial on the lifelines website here on how to do it, but I have included a code snippet below inspired by that tutorial because I want to add