Bayesian Machine Learning


Bayesian machine learning is a way of using Bayes theorem to estimate the posterior of model distribution given the observed data.

Parameter Estimation

Parameter estimation is a common application of Bayesian machine learning. Given a model with some unknown parameters \theta , we use Bayes theorem to estimate the probability distribution p(\theta|x) of these parameters.

To make the posterior distributions easy to calculate, we often select the prior distribution to be conjugate prior as the posterior distributions.

CMU lecture video:

Maximum a Posteriori (MAP) Estimation

Calculating the posterior distributions can involve complex integrals that are not directly calculatable. MAP is a method to simplify this calculation. Instead of calculating the whole posterior distribution, we estimate \theta to be the point with maximum posterior probability. It is equivalent to finding the mode in the distribution. The problem then becomes an optimization problem: to find the variable \theta that maximize the posterior probability of the model.

Although MAP is easier to implement, it also has some problems.

Problem 1: the mode can be an atypical point in the distribution. The following figure is an example of it.

Evernote Snapshot 20170313 105331

Problem 2: MAP estimate is not invariant to reparmeterization. That is, given a transformation y=f(x), the mode of y does not necessarily equal to f(x_{mode}).



Helpful Websites/Blogs

Here is a list of websites and blogs that I find helpful for learning about different topics in Computer Science.

The Morning Paper


The Morning Paper: a short summary every weekday of an important, influential, topical or otherwise interesting paper in the field of computer science.

A good way to keep up with what’s going on in various research areas in Computer Science.


Metacademy is built around an interconnected web of concepts, each one annotated with a short description, a set of learning goals, a (very rough) time estimate, and pointers to learning resources. The concepts are arranged in a prerequisite graph, which is used to generate a learning plan for a concept. Here are the learning plan and graph for deep belief nets.

It is a good place to quickly learn about a specific topic without the need to go through a whole textbook. Currently, the website focuses more on topics related to machine learning and artificial intelligence.


Frenetic: A Network Programming Language

With the development of software defined network (SDN) architectures, the network becomes programmable. For example, OpenFlow allows programmers to specify forwarding rules at the controllers, and the controller will install these rules to the switches. Unfortunately, these network programming languages such as OpenFlow and NOX can be difficult to use.

“… while OpenFlow and NOX now make it possible to implement exciting new network services, they do not make it easy.”

These languages have low-level of abstractions. While programming, programmers need to think about the switch-level rules, which are often composed of multiple tasks.


To Make Network Programming Easier

The goal of this paper is to simplify network programming by designing a higher level programming language. It starts from identifying three main problems in the existing programming language.

Problem 1: Anti-Modular

Here is an example for the anti-modular design in current network programming language.

Screen Shot 2017-03-09 at 10.41.47 AM

(picture from slides at

When we integrate the code of a repeater and a web monitor, we need to consider how these functions will influence each other at the switch level and carefully arrange the order and priority of each line of the code.

Problem 2: Two-Tiered Model

The current SDN architecture has a two-tiered model. When a packet arrives at the switch, the switch will check whether there are existing rules about how the packet should be forwarded. If so, the switch will forward the packet without notifying the controller. Otherwise, the switch will forward the packet to the controller. This design brings a problem: the controller will not see all the packets in the network. If we want to monitor all the packets in the network, we need to inspect and modify rules at the switch level.

Screen Shot 2017-03-09 at 10.55.44 AM

(picture from slides at

Problem 3: Network Race Conditions

When a switch S sees a packet p1 that it doesn’t know where to forward, it will ask the controller to help with the following steps:

  1. S forwards p1 to the Controller C
  2. finds out the forwarding rules for p1 and sends the rules r to S
  3. installs r received from C
  4. When S receives a new packet p2 that matches r, it uses to forward p2

However, since the switches and the controller form a distributed system, race conditions can occur. For example, step 4 can happen before step 3 because it takes time for r to be transferred from C to S. In this case, S would forward p2 to C because it has not received the rules for p2 yet.

Separation of Reading and Writing

All these three problems in current network programming can be contributed to the lack of efficient separation between reading (i.e., monitoring network conditions) and writing (i.e., specifying forwarding rules). Frenetic solves these problems by decomposing the language into two tasks: monitoring and forwarding.

Frenetic provides a declarative query language for classifying and aggregating net- work traffic as well as a functional reactive combinator library for describing high-level packet-forwarding policies.

Screen Shot 2017-03-09 at 11.28.51 AM



Network Virtualization in Multi-tenant Datacenters

This is a paper about network virtualization written by researchers in VMware and UC Berkley.

What is network virtualization?

No one has proposed a formal definition for it yet, so the paper gives its own explanation:

… a network virtualization layer allows for the creation of virtual networks, each with independent service models, topologies, and addressing architectures, over the same physical network.

Sounds familiar? Network virtualization is just a network version of hypervisors. While hypervisor allows for the creation of virtual machines with isolated hardware resources, network virtualization allows similar separation in network architecture and resources.

The following slide shows the analogy between a hypervisor (the paper calls it a server hypervisor to distinguish from the network hypervisor) and a network hypervisor. The network hypervisor is implemented on the standard IP connectivity and offers independent L2, L3, L4 – L7 services for each logical network.


(Picture from slides at

Why do we need network virtualization?

Today, it is difficult for a single physical topology to support the configuration requirements of all of the workloads of an organization.

Since the workload of different applications varies greatly in a data center, the organization must build multiple physical networks to meet the requirement of different workloads. If all these network topologies can be virtualized on top of the same physical network, it would be much more convenient.

How to implement a network hypervisor?

To abstract the network, we need to deal with two aspects of abstraction: packet abstraction and control abstraction.

For packet abstraction, we need to reproduce the packet structure as if the packet were generated in a real physical network.

This abstraction must enable packets sent by endpoints in the MTD to be given the same switching, routing and filtering service they would have in the tenant’s home network. This can be accomplished within the packet forwarding pipeline model.

The control abstraction is more challenging because we must allow the tenants to define their own logical data paths, which is not supported in a real physical network. In NVM, they implement it with flow tables.

Each logical datapath is defined by a packet forwarding pipeline interface that, similar to modern forwarding ASICs, contains a sequence of lookup tables, each capable of matching over packet headers and metadata established by earlier pipeline stages.

In NVM, these forwarding pipelines are implemented at each server (hypovisor).  That is, they implement the whole network structure as a software, and each server is running a replicate of the software! Before leaving its host server, each packet needs to go through the whole forwarding pipeline. After that, the packet is directed tunneled to the destination server.

Screen Shot 2017-03-06 at 12.28.11 PM

(Picture from slides at

Link for video:


Principles of Data Reduction

Key Definitions:

  • The Sufficiency Principle: sufficient statistics, ancillary statistics

T is a sufficient statistic of x if T contains all the information that is needed to estimate the parameter \theta. 

  • The Likelihood Principle

The likelihood function L(\theta | x) is a sufficient statistic (controversial)

  • The Equivariance Principle


Properties of a Random Sample

Key Definitions:

independent and identically distributed random variables (i.i.d.)

sampling from an infinite population, sampling from a finite population (with replacement, without replacement)

sampling distribution (the distribution of a statistic)

order statistics #Def. 5.4.1 p.226

weak law of large numbers (WLLN) #Theorem 5.5.2 p.232

strong law of large numbers #Theorem 5.5.2 p.235