A deep take on countability, cardinality and ordering

I’ve been teaching CMSC250, Discrete Mathematics, over the past year in CS UMD. Last semester, I typed a more philosophical than mathematical post on Countability, Cardinality and Ordering, which I’m repeating here for the community’s sake.


After our ordinality lecture last Tuesday, I had a student come to me and tell me that they were not sure how to think about ordinality: they were understanding the relationship between cardinality and size, since it is somewhat intuitive even for infinite sets (at least to them!), but ordinality still appeared esoteric. That’s 100% natural, and in this post I will I’ll try to stray away from math and try to explain how I think about countability, cardinality and ordinality intuitivelyThis post has exactly zero things to do with the final, so if you want to limit your interactions with this website to the exam-specific, you may stop reading now.

Before we begin, I would like to remind you of a definition that we had presented much earlier in the semester, I believe during an online quiz: A set S is dense if between any two elements of it, one can find another element. Note something interesting: only ordered sets can be qualified as dense or not! Technically, we had not presented the notion of an ordered set when we discussed dense sets, but it is intuitive enough that people can understand it.

Countability

We say that any enumerable set is countable. Enumerable, mathematically, means that we can find a bijection from the non-zero naturals to the set. Intuitively, it means “you start from somewhere, and by sequentially making one step, no matter how long it takes, you are guaranteed to reach every single element of the set in finite time”. Whether this finite time will happen in one’s lifetime, in one’s last name’s lifetime, or before the heat death of the universe, is inconsequential to both the math and the intuition. Clearly, this is trivial to do for either the non-zero naturals or the full set of naturals: you start from either 1 or 0, and then you make one step “forward”.

However, we also saw in class that this is possible to also generalize for the full set of integers: we start from 0 and then start hopping around and about zero, making bigger hops every time. Those hops are our steps “forward”.

Those results are probably quite intuitive to you by now, and I feel that the reason for this might that both LaTeX: \mathbb{N} and LaTeX: \mathbb{Z} are non-dense sets.There are no naturals or integers between LaTeX: n and LaTeX: n+1 (LaTeX: n \in \mathbb{N}  or LaTeX: n \in \mathbb{Z} ).

Let’s stray away from LaTeX: \mathbb{Q}  for now and fast-forward to LaTeX: \mathbb{R} . We have already shown the mathematical reason, Cantor’s diagonalization, for which the set of reals is uncountable. But what’s the intuition? Well, to each their own, but here’s how I used to think about it as a student: Suppose that I start from zero just to make things easier with respect to my intuitive understanding of the real number line (I could’ve just as well started with LaTeX: -e^8). 

Then, how do I decide to make my step forward? Which is my second number? Is it 0.1? Is it -0.05? But, no matter which I pick as my second number, am I not leaving infinitely many choices in between, rendering it necessary that I recursively look into this infinite interval? Note that I have not qualified “infinite” with “countably infinite” or “uncountably infinite” yet. This was my personal intuition as a Discrete Math student about 11 years ago about why LaTeX: \mathbb{R}  is uncountable: Even if you assume that you can start from 0, there is no valid ordering for you to reach the second element in the sequence of reals! Therefore, such a sequence cannot possibly exist!

But hold on a minute; is it not the case that this argument can be repeated for LaTeX: \mathbb{Q} ? Sure it can, in the sense that between, say, LaTeX: 0 and LaTeX: \frac{1}{2}, there are still infinitely many rationals. It is only after we formalize the math behind it all that we can say that this is a countable infinity and not an uncountable one, as is the case of the reals. But still, we have to convince ourselves: why in the world is it that the fact that every one of these infinite numbers can be expressed as a ratio of integers make that infinity smaller than that of the reals?

Here’s another intuitive reason why we will be able to scan every single one of these numbers in finite time: everybody open the slide where we prove to you that LaTeX: \mathbb{Q}^{>0}  is countable using the snaking pattern. Make the crucial observation that every one of the diagonals scans fractions where the sum of the denominator and the numerator is static! The first diagonal scans the single fraction (LaTeX: \frac{1}{1}) where the sum is 2. The second one scans the fractions whose denominator and numerator sum is 3 (LaTeX: \frac{1}{2},\ \frac{2}{1}). In effect, the LaTeX: i^{th} diagonal scans the following fractions:

LaTeX: \{ \frac{a}{b} \mid (a,b \in \mathbb{N}^{\geq 1}) \land (a + b=i+1)\}

For those of you that know what equivalence classes are, we can then define LaTeX: \mathbb{Q}^{>0}  as follows:

LaTeX: \mathbb{Q}^{>0} = \bigcup_{i \in \mathbb{N}}\{ \frac{a}{b} \mid (a,b \in \mathbb{N^{\geq 1}}) \land (a + b=i+1)\}

Let’s see this in action…

LaTeX: \mathbb{Q}^{>0} = \{ \color{red}{\underbrace{ \frac{1}{1}}_{i=1}}, \color{blue}{ \underbrace{\frac{1}{2}, \frac{2}{1}}_{i=2}} , \color{brown}{\underbrace{\frac{1}{3}, \frac{2}{2}, \frac{3}{1}}_{i=3}}, \dots \}

Note that essentially, with this definition, we have defined a bijection from LaTeX: \mathbb{N^{\geq 1}} \times \mathbb{N^{\geq 1}} to LaTeX: \mathbb{Q}. We know that LaTeX: \mathbb{N}^{\geq 1} \times  \mathbb{N}^{\geq 1} is countable, so we now know that LaTeX: \mathbb{Q}^{> 0}  is also countable! 🙂

Let’s constrain ourselves now to the original challenge that we (I?) are faced with: we have selected 0 as our first element in the enumeration of both LaTeX: \mathbb{Q}  and LaTeX: \mathbb{R}  (the latter is assumed to exist), and no matter which our second element is (say it’s LaTeX: \frac{1}{2}), we have infinitely many elements in both sets between 0 and LaTeX: \frac{1}{2}But now we know that those infinites are different: in the case of LaTeX: \mathbb{Q} . we know for a fact that we will reach all of those fractions whose decimal values are in LaTeX: (0, 0.5). In the case of LaTeX: \mathbb{R} , there is no such enumeration: any enumeration we define will still leave an… uncountably infinite gap between any two elements in “sequence”.

Remember how in our lecture on Algebraic and Transcendental numbers, we gave only three examples of numbers in LaTeX: TN, yet the fact that LaTeX: TN is uncountable when LaTeX: ALG is countable guarantees that there are “many more” Transcendental numbers than Algebraic? Same thing applies here with the rationals and irrationals: given any interval of real numbers LaTeX: (r_1, r_2), there are many more irrationals than rationals inside that interval... If you define a system of whole numbers (integers), there are many more quantities that you will not be able to express as a ratio of integers. That’s why back in the day (300 B.C) when Euclid proved that LaTeX: \sqrt 2 is not expressible as such a ratio LaTeX: \frac{a}{b} (or, more accurately, that LaTeX: 2 cannot be expressed as the square LaTeX: \frac{a^2}{b^2}) his result was so unintuitive; those Hellenistic people did not have rulers. They did not have centimeters or other accepted forms of measurement. The only thing they had were shoestrings, or planks of wood which they put in line and “saw” that they were the same length, and then they measured everything else as the ratio of such “whole” lengths.

 Cardinality

Recall something that we said when we were discussing the factorial function and its combinatorial interpretations when applied on positive integers. Bill’s explanation of why LaTeX: 0!=1 was purely algebraic: If it were LaTeX: 0, then, given the recursive definition LaTeX: n!\:=\:n\:\cdot\left(n-1\right)! for LaTeX: n\:\ge1, every LaTeX: n! would be LaTeX: 0rendering it a pretty useless operation. My explanation was combinatorial: we know that if we have a row of, say, LaTeX: n marbles, there are LaTeX: n! different ways to permute them, or LaTeX: n! different orderings of those marbles. When there are no marbles, so LaTeX: n=0there is only one way to order them: do nothing, and go watch Netflix. 

Let’s stick with Bill’s interpretation for a moment: the fact that some things need to be defined in order to make an observation about the real world work. In this case, the real world is defined as “algebra that makes some goddamn sense”. My explanation is more esoteric. You could say: “What do you mean there’s only one way to arrange zero things? I don’t understand, if there are zero things and there’s nothing to do, shouldn’t there be, like, 0 ways to arrange them?”. So, let’s stick with Bill’s interpretation to explain something that I attempted to explain to a group of students after our first lecture this semester: Why do negative numbers even exist?

Here’s one such utilitarian explanation: Because without negative numbers, Newtonian Physics, with their tremendous application in the real world, would not work. That is, the model of Newtonian Kinematics with its three basic laws, which has been empirically proven to describe very well things that we observe in the real world, needs the framework of negative numbers in order to, well, work. So, if you’re not ok with the existence of negative numbers, you had better also be able to describe to me a framework that explains a bunch of observations on the real world in some way that doesn’t use them. For example, you probably all remember the third law of Newtonian motion: For every action LaTeX: \color{red}{\vec{F}}, there exists an equal and opposite reaction LaTeX: \color{red}{-\vec{F}}:

Recall that force is a vectoral quantity since it is the case that LaTeX: \vec{F} = m \cdot \vec{a}, and acceleration LaTeX: \vec{a} is clearly vectoral, as the second derivative of transposition LaTeX: \vec{x}

The only way for Newton’s third law of motion can work is if LaTeX: \vec{F} + (-\vec{F}) = \vec{0}. This is only achievable if the two vectors have the same magnitude but exactly opposite directions. No other way. Hence the need to define the magnitudes as follows:

LaTeX: | |\vec{F}|| = \frac{1}{2} \cdot m \cdot a^2,\ | |\vec{\color{red}{-}F}|| = \color{red}{-}\frac{1}{2} \cdot m \cdot a^2

and the necessity for negative numbers becomes clear. Do you guys think the ancient Greeks or Egyptians cared much for negative numbers? They were building their theories in terms of things they could touch, and things that you can touch have positive mass, length, height…

Mathematics is not science. It is an agglomeration of models that try to axiomatize things that occur in the real world. For another example, ZFC Theory was developed in place of Cantorian Set Theory because Cantorian Set Theory can lead to crazy things such as Russel’s Paradox. Therefore, ZFC had to add more things to Set Theory to make sure that people can’t do crazy stuff like this. If we discover contradictions with the real world given our mathematical model, we have to refine our model by adding more constraints to it. Less constraints, more generality, potential for more contradictions. More constraints, less generality, less contradictions, but also more complexity.

So when discussing the cardinality of LaTeX: \mathbb{N}  and LaTeX: \mathbb{Z}  and finding it equal to LaTeX: \aleph_0, we are faced with a problem with our model: the fact that LaTeX: \color{magenta}{\mathbb{N} \subset \mathbb{Z}} (I have used the notation of proper subset here deliberately). Now, I just had a look at our cardinality slides, and it is with joy that I noticed that we don’t use the subset / superset notation anywhere. That’s gonna prove a point for us.

So, back to the original problem: intuitively understanding why the hell LaTeX: \mathbb{N}   and LaTeX: \mathbb{Z}  have the same cardinality when, if I think of them on the real number line, I clearly have LaTeX: \mathbb{N} \subset \mathbb{Z}:

LaTeX: \underbrace{\dots, -4, -3 , -2, -1, \underbrace{0, 1, 2, 3, 4, \dots}_{\mathbb{N}} }_{\mathbb{Z}}

The trouble here is that we have all been conditioned from childhood to think about the negative integers as “minus the corresponding natural”. This conditioning is not something bad: it makes a ton of sense when modeling the real world, but when comparing cardinalities between infinite sets, that is, sets that will never be counted entirely in finite time, we distance ourselves from the real world a bit, so we need a different mathematical model. To that end, let’s build a new model for the naturals. Here are the naturals under our original model:

LaTeX: 0, 1, 2, 3, \dots

This digits that we have all agreed to be using have not been around forever. The ancient Greeks used lowercase versions of their alphabet: LaTeX: \alpha, \beta, \gamma, \delta , \epsilon, \sigma \tau ', \zeta,\ \dots\ \omega  to name a total of 25 “digits”, while the Romans used a subset of their alphabet “stacked” in a certain way: LaTeX: I, II, III, IV, V, VI, \dots, X, XI\dots . These “stacked” symbols cannot be really called digits the way that we understand them, especially since new symbols appear long down the line (LaTeX: C, M) etc. These symbols we actually owe to the Arabic Renaissance of the early Middle Ages.

The point is that I can rename every single one these numbers in a unique way and still end up with a set that has the exact same properties (e.g closure of operations, cardinality, ordinality) as LaTeX: \color{red}{\mathbb{N}}. This is formally defined as the Axiom of Replacement. So, let’s go ahead and describe LaTeX: \mathbb{N}  by assigning a random string for every single number, assuming that no string is inserted twice:

LaTeX: foo, bar, otra, zing, tum, ghi,\dots

Which corresponds to our earlier

LaTeX: 0, 1, 2, 3, 4, 5,\dots

Cool! Now the axiom of replacement clearly applies to LaTeX: \mathbb{Z}  as well, so I will rewrite

LaTeX: \dots, \color{blue}{-5, -4, -3, -2, -1,}\ \color{magenta}{0, 1, 2, 3, 4, 5,}\dots

into:

LaTeX: \dots, \color{blue}{qwerty, forg, vri, zaq,  nit,}\ \color{magenta}{bot, ware, yio, bunkm, ute, kue,}\dots

Call these “transformed” sets LaTeX: \mathbb{N}_{new} and LaTeX: \mathbb{Z}_{new} respectively. Under this encoding, guys, I believe it’s a lot more obvious that LaTeX: \mathbb{N}_{new} \not\subset \mathbb{Z}_{new} in the general case. LaTeX: \mathbb{N}_{new} \subset \mathbb{Z}_{new} under these random encodings is so not-gonna-happenish that its probability is not even axiomatically defined. Therefore, now we can view LaTeX: \mathbb{N}  and LaTeX: \mathbb{Z}  as infinite lines floating around space, lines that we have to somehow put next to each other and see whether we can line them up exactly. If you tell me that even under this visualization, the line that represents LaTeX: \mathbb{Z} _{new} is infinite in both directions, whereas that of LaTeX: \mathbb{N}_{new}  has a starting point (0), then I would tell you that I can effectively “break” the line that represents LaTeX: Z_{new} in the middle (0) and then mix the two lines together according to the mapping that corresponds to:

LaTeX: 0, 1, -1, 2, -2, 3, -3, \dots

Now we no longer have the pesky notation of the minus sign, which pulls us to scream “But the naturals are a subset of the integers! Look! If we just take a copy of the naturals and put a minus in front of them, we have the integers!”. We only have two infinite lines, that start from somewhere, extend infinitely, and it is up to us to find a 1-1 and onto mapping between them. That is, it is up to us find a 1-1 mapping between:

LaTeX: foo, bar, otra, zing, tum, ghi,\dots

and

LaTeX: bot, ware, nit, yio, zaq, bunkm\dots

(Note that I re-ordered the previous encoding LaTeX: \dots, \color{blue}{qwerty, forg, vri, zaq,  nit,}\ \color{magenta}{bot, ware, yio, bunkm, ute, kue,}\dots according to the “hopping” map into  LaTeX: \color{magenta}{bot}, \color{magenta}{ware}, \color{blue}{nit,} \color{magenta}{yio}, \color{blue}{zaq}, \color{magenta}{bunkm},\dots .)

Under this “visual”, you guys, it makes a lot of sense to try to estimate if the two sets have the same cardinality and, guess what, they do 🙂

Not much else to say on this topic everyone. We can have a bunch of applications of the axiom of replacement to prove, for example, that the cardinality of the integers, LaTeX: \aleph_0, is also the cardinality of LaTeX: \mathbb{N} \times \mathbb{N}LaTeX: \mathbb{Q} , etc. It is only when we start considering sets such as LaTeX: \mathbb{R} , \mathcal{P}(\mathbb{N}) and LaTeX: \{0, 1 \}^\omega  that this idea that we can be holding two infinite lines in space fails.

Ordinality

There’s not much to say here except that the easiest way to understand how an order differs from a set is to consider an ordering exactly as such: an order of elements! Think in terms of “first element less than second less than third less than …. “. The simplest way possible. It is then that we can prove rather easily that LaTeX: \omega \prec y \prec \zeta .

Things only become a bit more complicated when considering the ordering LaTeX: \omega + \omega:

LaTeX: 0 < \frac{1}{2} < \frac{3}{4} < \frac{5}{6} < \dots <1 < \frac{3}{2} < \frac{4}{3} <\dots <2 <\dots

Please note that this ordering is clearly not the same as LaTeX: \eta, the ordering of LaTeX: \mathbb{Q} . Between the first and the second element, for instance, there are countably many infinite rationals: LaTeX: \frac{1}{100}, \frac{2}{5}, \dots , \frac{3}{7}\dots  which are not included in the ordering. 

Finally, realize the meaning of “incomparable” orderings: a pair of orderings LaTeX: \alpha, \beta  will be called incomparable if, and only if:

LaTeX: (\alpha \npreceq \beta) \wedge (\beta \npreceq \alpha).

So please realize that this is not the same as saying, for instance, LaTeX: \beta \nprec\alpha .

I think this is all, I am bothered when I can’t explain something well to a student so I thought I’d share my views on countability in case the subject becomes easier to grasp.


Advertisements

Piazza sucks.

I’m in Academia. Well, at least the part of Academia that’s still related to actual teaching.

The vast majority of my collaborators in UMD as well as in other institutions use Piazza to host their courses. It’s easy, it’s fast and at the very least it looks like it has close to 100% uptime (written during a time that our UMD fork of Instructure’s Canvas has been down for hours).

However, active development on Piazza has effectively stopped since early 2015. During Fall 2016, I would look at my Android Play Store for updates on the Piazza app and for a long time the latest one would date back to February 2015. Right now, it appears that certain patches have been made as early as Feb 7,2017, but the app is still atrocious. This is just an example of the many issues that surround Piazza.

The most major issue, for which I just submitted a bug report, is the fact that there is no filesystem consistency on Piazza. If you use the “Resources” tab and post a link to the file in a discussion topic, and you want to make a change to the file, then all the other links become stale. They point to positions in an Amazon S3 filesystem. When the number of links to a file grow, this becomes a huge problem.

Furthermore: the only way to password-protect your page right now (of importance to any person who needs a private discussion forum), you need to actually send an e-mail to team@piazza.com with your password choice, which of course is then stored in cleartext in your e-mails. My response was immediately adhered to, but what happens if you need to password-protect it during a weekend? I use Piazza to communicate with my TAs, and the information conveyed is often sensitive (thoughts on midterms, recitation topics, rubrics). I don’t want students snooping around (it’s already happened once this semester).

Piazza is perfect for communication: students love it, because other students can immediately offer responses. In contrast, nobody ever uses the Canvas “Discussion” feature. However, in departments where the average course registration is in the hundreds (like us), moderating such a huge forum requires TAs dedicated to doing only that. It’s not impossible, but it’s hard. Ensuring that solutions to problems under active submission don’t leak is tough. A student can take down a post within minutes, yet a PDF with solutions can already have reached a good portion of the class.

But all of these things are simple issues of technical decisions and design. They could be met through either Bug Reports or (a)synchronous brainstorming sessions using tools like Confluence or HipChat. What really bugs me is how the Piazza team doesn’t seem to care any more about the product, shifting their entire focus into making it yet another recruitment platform, which they call “Piazza Careers“. Seriously? That’s what students need? Another recruitment platform? I’m guessing that the Piazza team had some sort of shift in their venture capital and it was required of them to transform the entire platform into another recruiting platform.

It’s a real bummer. Blackboard has been a disaster (link) and Canvas has too many issues to discuss in a blog post. Active Learning services like TopHat are breakable when people use their phones. I recently had a student pretty much admitting to me during my office hours that they never attended the lectures of a certain math course, but had a friend of theirs text them the TopHat code and the proper answers to the questions. The clickers and accompanying software offered by Turning Technologies  have multiple issues of connectivity, even though those issues mostly have to do with the ELMS-CANVAS integration and not the software or clicker device itself.

We need reliable educational software. Not another recruitment platform. Pooja Shankar, an alumni of CS UMD and the founder of Piazza, ought to be the first person to recognize this.

 

 

I had my Discrete Math students critique Charlie the Unicorn, and here is how some responded.

Title says it all. This summer I’m teaching CMSC 250, “Discrete Structures” (really, this is a misnomer; I have no idea why we don’t call it “Discrete Mathematics”), to undergraduate students in the Department of Computer Science at UMD. As one of the requirements of the course, I had them review the epic saga of Charlie the Unicorn and submit a short essay. Now I knew these kids are bright and have a sense of humor, yet once again they surpassed all expectations.

Here are anonymous excerpts of what was handed to me:

The “Charlie the Unicorn” series has taught me about the dangers of the world we live in today. Life isn’t always rainbows and unicorns and I’m pretty glad it isn’t. That world seems messed up. There are tons of two-faced people out there and it is important to read through them or else they will get you to think you are the banana king and steal your stuff.

Why yes indeed, you never know when that might happen.

The Pink and Blue Unicorns are sociopathic robbers who are unable to distinguish
reality from fantasy, as well as being able to force their fantasies onto others through either hypnosis or hallucinogenic drugs. It is obvious that these two unicorns are a threat to society and need to be put into an insane asylum and be rendered unable to create their fantasy worlds.

Ouch! So much for second chances.

While Charlie is being manipulated, they continually make fun of him and steal his belongings. These acts seem to be unprovoked and only cause them enjoyment; they gain no real reward from these acts. Each situation they get Charlie into results in a catchy song followed by the immediate death of the performer.

My favorite part of Charlie the unicorn was the part when Charlie is convinced that he is the banana king. It’s probably true that if you levitate and shine and light on someone you could probably convince them of anything.

In an attempt to make sense of this video, the only conclusion that I could come to was that this is what Jason Steele, the creator of Charlie the Unicorn, experienced while higher than a kite. I would imagine that his stoner hallucinations were best manifested in a video where he and his friends were portrayed by unicorns, so that is exactly what Steele created.

Some people were more introspective than others:

Charlie the Unicorn is a politically-themed satire lambasting both Democratic and Republican politicians alike.  In the video, Democrats are symbolized by the blue horse and Republicans, the red.  The third horse, Charlie, represents the average citizen, with his white color additionally connoting the average citizen’s relative innocence and naïveté in politics.  The blue and red horses—henceforth referred to as “the purple horses”—employ fanciful promises and extreme enthusiasm to slowly goad the white horse—who is initially reluctant—into travelling to Candy Mountain with them.  This journey represents an ordinary citizen being stirred out of political apathy by the campaigning of a compelling politician spouting ideals, hopes, and promises of a better tomorrow.  However, the motivations of the purple horses were not so noble or selfless;[…]

While others actually hinted towards inductive reasoning / rule learning:

Each adventure involves an annoying commute to the destination with the pink and blue unicorn, arriving at the destination, receiving a song, having the singer blow up, and then Charlie somehow being put into danger. From this pattern, we can build an implication relationship which Charlie quickly learned. If Charlie goes on an adventure with the pink and blue unicorn, then he will be put in danger. As far as then fourth chapter of their adventures, this rule has been valid. But we do not know for sure if it will apply for future episodes.

Or human persuasion techniques:

To me the fact that there are 3 unicorns was interesting. People tend to believe when more than 3 people start believing some idea. For example if 3 people points to the sky in the middle of the road, other people start looking at the sky since people think there must be reason the 3 people are pointing to the sky. It is called the “Power of 3”.

This person, along with the person who provided the politically themed comments, seemed to be the ones closer to what the Internet believes the videos to be about:

One thing I did find interesting throughout all the episodes is that no matter how evil the things were Pink and Blue unicorn did to Charlie were (like taking his kidney), he went on every single adventure with them. After losing my kidney or my belongings by hanging out with my friends I wouldn’t want to hang out with them anymore. I don’t know if they’re necessarily Charlie’s friends to begin with which makes me question his decisions to follow them even more. In the last episode, Pink and Blue unicorn tried to take his life, but starfish came and rescued Charlie. I honestly could not stop laughing when starfish told Charlie that he was a star and then when Charlie made the wish, starfish’s eyes burned out. I was questioning why starfish was so in love with Charlie in the third episode, but good thing he was a starfish or else Charlie wouldn’t have lived. YOLO. I wonder why Pink and Blue unicorn were able to take everything away from Charlie except for his life. Was the creator trying to tell us something there? Whatever, I’m not going to think too much into it. A+, 10/10 would watch again.

Finally, if you’re interested in finding what the Internet thinks these videos are about, (a) You have a serious problem and (b) Here you go:

Thirty Seconds Flat

Yesterday night I watched the 1995 Michael Mann crime epic Heat for the umpteenth time. It is my understanding that the movie’s not particularly appraised, and it’s definitely not among Mann’s most well-known titles. Critics and movie-goers tend to think of The Insider or The Last of the Mohicans, or maybe his Miami Vice work in the 80s, as being his defining directorial moments,  Regardless, for me the movie has attained artistic status that elevates it beyond that of a motion picture and up to par with, I don’t know, the Sistine Chapel perhaps. Think hard before you label this as sacrilege.

Chances are that if you’ve heard about the movie, you know about the diner scene, where Al Pacino and Robert de Niro, playing career cop and criminal respectively, are pitted against each other, face to face, standing firm about who they are and what they’re looking to do. It’s a marvelous scene, and if you haven’t watched it, you should.

My favorite scene, however, happens before that one. Unsurprisingly, one cannot easily find a YouTube link to it. Backdrop: L.A Homicide collectively decide on a night out with their wives. Vincent Hanna (Pacino) dances with his wife Justine Hanna (Diane Venora), both of them tipsy. At some point, Pacino gets paged (pagers, I know, right?) and his oversight is requested in a murder scene loosely connected to the main plot. “This better be earth-shattering”, he says.

Couple hours later, Pacino arrives back in the dining area, currently occupied by just Justine and another couple in a different table. Justine, obviously distraught, begins the following dialogue, which I will recite from memory, so excuse any minor discrepancies:


– I guess the earth shattered.

-So why didn’t you let Bosko take you home? (Bosko is another cop in the unit.)

– I didn’t want to ruin their night too!

-…

– So what happened?

– Honey, you don’t wanna know.

– I’d like to know what’s behind that grim look on your face!

– I don’t do that, you know that. Come on, let’s go.

 – You never told me I was gonna be excluded.

– I told you when we first hooked up, honey, that you would have to share me with all the bad people and ugly events on this planet.

– And I bought into that sharing, because I love you. I love you fat, bald, money, no money, driving a bus, I don’t care. But you have got to be present, like a normal guy, some of the time. This is not sharing. This is leftovers.

– Oh, I see, so what I should do is come home and tell you: “Hey baby, guess what. I just walked out of a crime scene where this junkie asshole fried his baby in a microwave because it was crying too loud. So let me share that with you. And in sharing, we will somehow.. ummm.. cathartically dispel all of this heinous shit.” Right? Wrong.


 

This is your life. There’s a fire inside you, and it’s raging on and on day and night. It encodes what you want, what you’re looking at, and what you’re after. And the closer you get to it, the harder it burns, the harder your brain is telling you to quit and look for safety. Vince Hanna got into 3 marriages in order to lie to himself that he cares about things beyond his work. Neil McCauley (De Niro) attempts a serious relationship for the first time ever, but he knows the drill: “If you wanna be making moves on the street, never get attached to anything or anybody that you can’t walk out on thirty seconds flat after you spot the heat coming round the corner.” 

Maybe you have the opportunity to be with somebody you have feelings for, and one night you wake up, see them occupying the other side of the bed and go screw a stranger in the local dive bar.

Maybe you’re close to having the job of your dreams but you cower out and stay in your current job because it affords you safety.

Maybe you’re not asking for that hot girl’s number because you fear what will happen if she agrees to it.

Maybe you’re in a one-year long relationship you knew had practically ended a month into it.

Maybe you’re close to being one year sober, and because humanity has agreed to the Westernized division of 365 days per annum, you get shitfaced on the 364th night.

Maybe, maybe, maybe.

It doesn’t really matter. If you don’t do it, if you don’t try your utmost to touch that fire, you will always regret it. And that’s a slow death that is far worse than anything I can imagine.

 

As new data arrives, the covariance matrix takes notice.

The problem

I recently read a paper on distributed multivariate linear regression. This paper essentially deals with the problem of when to update the global multivariate linear regression model in a distributed system, when the observations available to the system arrive in different computer nodes, at different times and, usually, at different rates. In the monolithic, single node case, the problem’s of course been solved in closed form, since for dependent variables and design matrix X with examples in rows, the parameter vector β can be found as per:

The linear regression solution.

This is a good paper, and anybody with an interest in distributed systems and / or  linear algebra should probably read it. One of the interesting things (for me) was the authors’ explanation that, as more data arrives at the distributed nodes, a certain constraint on the spectral norm of a matrix product that contains information about a node’s data becomes harder to satisfy. It was not clear to me why this was the case and, in the process of convincing myself, I discovered something that is probably obvious to everybody else in the world, yet I still opted to make a blog post about it, because why the hell not.

When designing any data sensor, it is reasonable to assume that the incoming multivariate data tuples will all have a non-trivial covariance. For example, in the case of two-dimensional data, it is reasonable to assume that all the incoming data points will not all lie on a straight line (which denotes full inter-dimensional correlation in the two-dimensional case). In fact, it is reasonable to assume that as more data tuples arrive, the covariance of the entire data tends to increase. We will examine this assumption again in this text, and we will see that it does not always hold water.

This hypothesized increase in the data’s covariance can be mathematically captured by the spectral (or “operator”) norm of the data’s covariance matrix. For symmetric matrices, such as the covariance matrix, the spectral norm is equal to the largest absolute eigenvalue of the matrix. If a matrix is viewed as a linear operator in multi-dimensional cartesian space, its largest absolute eigenvalue tells us how much the matrix can “stretch” a vector in the space. So it gives us an essence of how “big” the matrix is in that sense, hence its incorporation into a norm formulation.

The math

We will now give a mathematical intuition about how the incorporation of new data in a sensor leads to a likelihood of increase of its spectral norm, or, as we now know, its dominant eigenvalue. For simplicity, let us assume that the data is mean centered, such that we don’t need to complicate the mathematical presentation with mean subtraction. Let λ be the covariance matrix’s dominant eigenvalue and u be a unitary eigenvector in the respective eigenspace. Τhen, from the relationship between eigenvalues and eigenvectors, we obtain:

Derivation 1

with the second line being a result of the fact that u is assumed unitary. It is therefore obvious that, in order to gauge how the value of λ varies, we must examine the 2-norm (Euclidean norm) of the vector on the right-hand side of the final equals sign.

Let’s try unwrapping the product that makes up this vector:

Derivation 2

Now, let us focus on the first element of this vector. If we unwrap it we obtain:

Derivation 3

The crimson factors really let us know what’s going on here, since the summations in the parenthesis involve the “filling up” of values in the covariance matrix that lie beyond the main diagonal. For fully correlated data, those values are all zero. On the other extreme, they are all non-zero. It is natural to assume that, as more data arrives, all those values tend to deviate from zero, since some inter-dimensional uncorrelation is, stochastically, bound to occur. On the other hand, if new data is such that it causes an increased inter-dimensional correlation, then the sum will tend towards zero, and the covariance matrix’s spectral norm will actually decrease!

The second vector element deals with the correlation between the second dimension and the rest, and so on and so forth. Therefore, the larger the values of these elements, the larger the value of the 2-norm || X’ X u ||  is going to be and vice versa.

Some code

We can demonstrate all this in practice with some MATLAB code. The following function will generate some random data for us:

function X = gen_data(N, s)
%GEN_DATA Generate random two-dimensional data.
% N: Number of samples to generate.
% s: Standard deviation of Gaussian noise to add to the y dimension.
x = rand(N, 1);
y = 2 * x + s.* randn(N, 1); % Adding Gaussian noise
X = [x,y];
end

This function will generate the covariance matrix of the input data and return its spectral norm:

function norm = cov_spec_norm(X)
% COV_SPEC_NORM: Estimate the spectral norm of the covariance matrix of the
% data matrix given. 
%   X: An N x 2 matrix of N 2-dimensional points.

COV = cov(X);
[~, S, ~] = svd(COV);
norm = S(1,1).^2;
end

Then we can use the following top-level script to create some initial, perfectly correlated data, plot it, estimate the covariance matrix’s spectral norm, and then examine what happens as we add chunks of data, with increasing amounts of Gaussian noise:

% A vector of line specifications useful for plotting stuff later
% in the script.
linespecs = cell(4, 1);
linespecs{1} = 'rx';linespecs{2} = 'g^';
linespecs{3} = 'kd'; linespecs{4} = 'mo';

% Begin with a sample of 300 points perfectly
% lined up...
X = gen_data(300, 0);
figure;
plot(X(:, 1), X(:, 2), 'b.');  title('Data points'); hold on;
norm = cov_spec_norm(X);
fprintf('Spectral norm of covariance = %.3f.\n', norm)

% And now start adding 50s of noisy points.
for i =1:4
    Y = gen_data(50, i / 5); % Adding Gaussian noise 
    plot(Y(:,1), Y(:, 2), linespecs{i}); hold on;
    norm = cov_spec_norm([X;Y]);
    fprintf('Spectral norm of covariance = %.3f.\n', norm);
    X = [X;Y]; % To maintain current data matrix
end
hold off;

(Note that in the script above, every new batch of data gets an inreased amount of noise, as can be seen in the call to gen_data.)

One output of this script is:

>> plot_norms
Spectral norm of covariance = 0.191.
Spectral norm of covariance = 0.200.
Spectral norm of covariance = 0.200.
Spectral norm of covariance = 0.220.
Spectral norm of covariance = 0.275.

A plot of our dataInterestingly, in this example, the spectral norm did not change after incorporation of the second noisy data. Can it ever be the case that we can have a decrease of the spectral norm? Of course! We already said that the crimson summations above, corresponding to summations over cells of the covariance matrix beyond the first diagonal, can fall closer to zero after we incorporate new data whose dimensions are more correlated with the existing data’s. Therefore, in the following run, the incorporation of the first noisy set actually increased the amount of inter-dimensional correlation, leading to a smaller amount of covariance (informally speaking).

>> plot_norms
Spectral norm of covariance = 0.177.
Spectral norm of covariance = 0.174.
Spectral norm of covariance = 0.179.
Spectral norm of covariance = 0.220.
Spectral norm of covariance = 0.248.

Another data plot.

Discussion

The intuition is clear: as new data arrives in a node, observing the fluctuation of the spectral norm of its covariance matrix can tell us some things about how “noisy” our data is, where “noisiness” in this context is defined as “covariance”. I guess the question to be made here is what to expect of one’s data. If we run a sensor long enough without throwing away archival data vectors, it’s unclear whether we can expect the spectral norm to continuously increase (at least not by a significant margin). We should expect a sort of “saturation” of the spectral norm around a limiting value. This can be empirically shown by a modification of our top-level script, which runs for 50 iterations (instead of 4) but generates batches of data with standard Gaussian noise, i.e the noise does not increase with every new batch:

% Begin with a sample of 300 points perfectly
% lined up...
X = gen_data(300, 0);
norm = cov_spec_norm(X);
fprintf('Spectral norm of covariance = %.3f.\n', norm)

% And now start adding 50s of noisy points.
for i =1:50
    Y = gen_data(50, 1); % Adding Gaussian noise 
    norm = cov_spec_norm([X;Y]);
    fprintf('Spectral norm of covariance = %.3f.\n', norm);
    X = [X;Y]; % To maintain current data matrix
end

Notice how the call to gen_data now adds normal Gaussian noise by keeping the standard deviation static to 1. One output of this script is the following:

>> toTheLimit
Spectral norm of covariance = 0.203.
Spectral norm of covariance = 0.341.
Spectral norm of covariance = 0.388.
Spectral norm of covariance = 0.439.
Spectral norm of covariance = 0.535.
Spectral norm of covariance = 0.635.
Spectral norm of covariance = 0.677.
Spectral norm of covariance = 0.744.
Spectral norm of covariance = 0.818.
Spectral norm of covariance = 0.842.
Spectral norm of covariance = 0.881.
Spectral norm of covariance = 0.913.
Spectral norm of covariance = 0.985.
Spectral norm of covariance = 1.030.
Spectral norm of covariance = 1.031.
Spectral norm of covariance = 1.050.
Spectral norm of covariance = 1.097.
Spectral norm of covariance = 1.148.
Spectral norm of covariance = 1.154.
Spectral norm of covariance = 1.186.
Spectral norm of covariance = 1.199.
Spectral norm of covariance = 1.280.
Spectral norm of covariance = 1.318.
Spectral norm of covariance = 1.323.
Spectral norm of covariance = 1.325.
Spectral norm of covariance = 1.344.
Spectral norm of covariance = 1.346.
Spectral norm of covariance = 1.373.
Spectral norm of covariance = 1.397.
Spectral norm of covariance = 1.447.
Spectral norm of covariance = 1.436.
Spectral norm of covariance = 1.466.
Spectral norm of covariance = 1.466.
Spectral norm of covariance = 1.482.
Spectral norm of covariance = 1.500.
Spectral norm of covariance = 1.498.
Spectral norm of covariance = 1.513.
Spectral norm of covariance = 1.518.
Spectral norm of covariance = 1.518.
Spectral norm of covariance = 1.499.
Spectral norm of covariance = 1.492.

It’s not hard to see that after a while the value of the spectral norm tends to fluctuate around 1.5. Under the given noise model (Gaussian noise with standard deviation = 1), we cannot expect any major surprises. Therefore, if we were to keep a sliding window over our incoming data chunks, and (perhaps asynchronously!) estimate the standard deviation of the spectral norm’s values, we could maybe estimate time intervals during which we received a lot of noisy data, and act accordingly, based on our system specifications.