Tuesday, October 21, 2014

The Dangers of Machine Learning

Analytical approaches to mining big datasets are just tools in the hands of the people implementing them.  Sometimes they can make things easy, other times they can be misused and lead to poor results.  And other times approaches can take on a life of their own and appear as panaceas for any big data challenge.  Machine learning is one of those mystified tools that often falls into this last category, and without a fundamental understanding of the underlying structural relationships can be costly and ineffective.
The issue is tied to the same principle behind causality versus correlation.  It is a basic idiom that correlation does not imply causality; just because the crime rate in NYC has dropped in the last 10 years and the price of a candy bar has increased, does not mean that increased candy prices cause less crime.  We know that because we know that the likelihood of some underlying fundamental and structural reason for those two things to be related is tenuous at best.  
Yet, many people think they can just naively apply machine learning to data sets to mine for interesting results.  When left unconstrained, machine learning techniques will often find far stronger signals with nonsensical relationships than with truly causally connected events.  And even with refinement of the input to your models, i.e. your feature space, the results are still based on correlations.
Take it from an industry where analytics and quantitative analysis play a huge role – finance.  In 2008, when the industry was permeated with systematic investment players executing their economic models based on machine learning techniques, something unexpected hit – a massive deleveraging in the market.  By and large the winners in that situation where those who understood this structural dynamic and what is implied, not the players naively applying their statistical models to the past few years.
 What can we learn from this?  It is not that machine learning is bad, but just that it is a tool and that in order to not be caught with an empty bank account you have to apply it in conjunction with also developing a fundamental understanding.  And if you are going to start somewhere, spending the time to carefully and logically construct a basic set of fundamental assumptions will often yield you less risky performance at first.
 Ultimately understanding a signal from a noisy dataset takes a variety of approaches.  It takes inductive approaches like machine learning and deductive approaches like hypothesis testing.  Heralding the ability to do one piece only means you have part of the solution, and if you rely on that alone you are likely setting yourself up for failure.

Saturday, June 22, 2013

Congratulations to Colorado AdTech company SpotXchange

Looks like Colorado's own SpotXchange has been recognized as part of the Earnst & Young entrepreneur of the year awards. 

http://www.ey.com/US/en/Newsroom/News-releases/News_2013-Award-winners-in-the-Mountain-Desert-region
Technology Services MikeShehan,CEO & Co-Founder, and Steve Swoboda,COO, CFO & Co-Founder, SpotXchange, Inc. (Westminster, CO)


Congratulations to Mike and Steve and to the entire SpotXchange team. It takes a great leader to run a successful company, but it also takes an amazing, hard working team backing the leadership. I have a few friends and former co-workers there that I have great respect for and know that the team there is top-notch and hard working

Their online video ad exchange is excellent and they have been one of the easiest partners to work with while building out our video product at The Trade Desk. As we expand our video rollout to Australia and other Asian markets we find SpotXchange has a strong presence there and look forward to growing together worldwide.. 

I, of course, have a special interest in Colorado AdTech companies having opened the Boulder office for The Trade Desk and hired a team of engineers here. I've been amazed at the quality of AdTech in the Boulder-Denver area. We have everything from top tier agencies, exchanges, data and analytics companies, DSP's, search marketing companies, and even representation from the likes of Google and Microsoft advertising teams. While it's not New York, this area's concentration of quality AdTech companies is impressive and growing. 

Tuesday, June 11, 2013

Micro Data is more important than Big Data

The term "Big Data" seems to be all the rage these days. Everyone uses it in someway to describe what their software company does. I often goad friends that work at startups claiming to be Big Data companies because they are handling the twitter stream or they are running X thousand transactions per second. I tell them to give me a call when their company gets up to scale a bit more.

In the online advertising world, especially with programmatic buying of ads, we have to handle nearly a million requests per second (and that is just the start of our scaling), make decisions in 5ms, track everything that happens across the globe and be able to report it to our clients within 5 minutes of it happening. With that kind of scale our budgeting systems have to be accurate within seconds or you could over spend by thousands of dollars. When you start to find that Amazon and Rackspace cloud environments can't handle your systems due to network speeds you may have achieved a decent level of scale (not kidding, we actually took down an entire Rackspace data center at one point).

So with the large number of transactions that we handle and the massive amount of data that we store and process every second, what is it that's important to us and what do we care about with our "Big Data"? There's not really much you can do with the mass of data as a whole, except maybe donate it to some research university to use in their studies, or keep it all somewhere and pay massive storage costs every month. No, it's not the Big Data that matters, what really matters is the "Micro Data" within that large data set that is interesting and with out the large amount of data it's not really possible to find micro data trends.

In online advertising the more micro the trend is the more valuable it is. If we know that every left handed race car driver in eastern Iowa is guaranteed to buy your product, then you are going to be willing to pay a lot of money to show that one person an ad (we of course care about a bit large audience than that one guy). If we can find millions of micro trends that are valuable to advertisers then we can really help guide their advertising budgets and make really good decisions on how much to spend on showing an ad to any given request to buy an ad among those million requests we see each second.

I don't claim to have coined the term micro data. I first heard about it from a friend over beers one evening. He is a researcher at the University of Colorado and is in a research lab with Big Data in the name. One of the bodies of data that they use in their research is the US Census data which collects massive amounts of data points on every household in the United States. He told me that there is nothing interesting about saying that the average annual salary of each household in the US is $X, or that the average family in the country has 1.8 children and 2.3 dogs. Those statistics are meaningless to all but politicians who want to use meaningless data for whatever purpose then need. Instead, these researchers look at the micro trends in the data to understand things better. It's much more meaningful to know the average or median income of a specific block in downtown Boulder or the average number of people living in each household in a single block in South Boulder.

That type of micro data helps in the advertising world as well since that is how advertisers want to be able to control their spending on ad buys. We spend a huge amount of our time looking for and processing "Big Data" to discover the "Micro Data" within that is so much more interesting and valuable.

Thursday, May 5, 2011

AlterEgo 2-Factor Authentication

File this under the "Why didn't I think of that" category. MailChimp just released a super cool new product called AlterEgo that enables 2-factor authentication for web apps using an IPhone application.

Check out:
If you've done anything with production systems, then you've probably used 2-factor auth and have always found it to be a pain, especially if you've had to add it to some software you're building. This product looks to bring a simplified version of 2-factor auth to webapps that can provide a really nice additional layer of security to webapps that need it. I'd love to hear from anyone who has used this product, either as a user or as an integration.

Monday, April 4, 2011

Centripetal Product Review of Devolutions Remote Desktop Manager

remotedesktopmanager-logoThis is a continuation of a series of product reviews of the products that make our cloud infrastructure work and make it workable. The product we’d like to show off today is one of those products that has just made the pain of managing a large server farm a whole lot easier.

Our deployment consists of numerous dedicated and cloud based servers in Rackspace datacenters in Texas, Chicago and London as well as Amazon cloud datacenters in California, Virginia, Ireland and Singapore. We run Microsoft Windows based servers as well as Redhat and Gentoo based Linux systems. Managing this type of an infrastructure has always been left up to Ops teams in my past jobs who had large teams and suites of pricey tools at their disposal. But as a small startup we needed a way for a couple of key members of our team to be able to access all our servers from one place in an easy fashion. Devolutions Remote Desktop Manager has been a life saver for us. Remote Desktop is a simple tool that organizes all of our remote servers in once place with built in configuration to work across all of them in unison with ease.

Among all of it’s features the ones that have come to be the most useful to me have been the ability to manage all of our systems, regardless of OS or location, from one console; the ability to have a central password store integrated into the tool; and the embedded, tabbed view of multiple remote sessions at once.

Manage Everything


rdm1One of the beautiful things about running a startup in this day in age is the vast resources available through cloud based deployments. A small startup company like ours has the ability to deploy hundreds of servers across all geographic locations in the world with the click of a mouse. Managing these can all be done from the comfort of our offices overlooking the beach in Southern California. However, managing this many servers can get to be quite a burden so any tool that can ease that burden is a blessing. With a disparate set of systems in our deployment we needed a tool that would allow us to view everything together in one place. Remote Desktop Manager rdm_sessionhas the ability. The Dashboard view within the tool allows me to look at all of my servers. I organize them by data center and region and then stick all of the Windows servers and Linux Servers in there together. Opening up a remote session is as easy as double clicking on a machine. The configuration of each machine is where I can setup what that session is, Windows RDP or Linux SSH console or whatever else. This ability is more rare than one might think in this type of a tool.




Central Credential Store

rdm_dsWith this many servers, managing passwords is a serious pain. In addition, as a small company with just a couple of people who know how to access most things we needed a way for the knowledge to be accessible in the case of something happening to key people. Remote Desktop Manager has the ability to store credentials for all types of remote sessions. This can be managed centrally by our one full time operation person while still allowing access to the others that need the occasional access. There seem to be a lot of different options for storing this that include higher levels of security as well as things as simple as a centrally accessed Access database, SQL Server, XML or even Dropbox storage.



Tabbed View

One of my favorite features of Remote Desktop Manager is the ability to have multiple sessions open at once and view them all within the RDM program within a tabbed view. This makes opening up a bunch of servers and going back and forth between them a breeze. The name of the server appears in the tab at the top, as opposed to just opening up multiple RDP sessions where I have no idea what is what. This is a God-send when I am needing to compare config files across servers, monitor event logs, or need to install a new release across multiple servers at once. I can still open the sessions up in full screen if I need to, but I don’t find myself doing that a whole lot lately.


Summary

All in all Devolutions Remote Desktop Manager has been a great tool. The only thing I have found myself wanting is an automated way to reach out to my Amazon or Rackspace account to suck in new servers as opposed to manually entering remote connection information, but that is a minor issue compared to the pain that this tool has removed from my day to day life of jumping around our remote, distributed cloud infrastructure. Thanks to the guys over at Devolutions for an awesome product.

Tuesday, March 22, 2011

Centripetal Product Review of CloudBerry S3 Explorer Pro

Part of building, running and maintaining a Software as a Service application like ours means having great tools at your disposal to make things easier. I wanted to highlight one of those tools that I've become particularly dependent on in my day to day work for keeping things running smoothly. The tool is CloudBerry S3 Explorer Pro by CloudBerry Labs.

We manage our complete backend logging, monitoring and reporting systems based on Amazon S3 at the core. There are lots of tools we run on top of that from custom built applications to Hadoop and everything in between. We currently have 10's of millions of individual files ranging in size from 100 bytes to over 1 GB. In Amazon's S3 environment we we manage over 100 buckets, spanning all Amazon worldwide regions with each containing complex directory structures that define content and date partitioning of the data in the files. We also run a lot of different portions of our applications through the Amazon CDN which is seemlessly integrated into S3. With all the applications and code we have dedicated to S3 specific functionality you'd think that we would never have a need to actually just look at the raw S3 structure and browse around for files or other things like that, but it has become a daily need to go in for one reason or another. When I have a need to go directly to S3 for something Cloudberry S3 Explorer Pro has been my tool of choice. It is indispensable when looking for individual files when debugging, doing copy or move jobs, or scripting more complex file jobs.

Debugging

A typical day finds me debugging an issue for one of our business team members. Many times I find that I need to go directly to the transaction logs of some server to determine exactly what happens. We store all of these in S3 in order to run transaction reports across them with Hadoop and other tools. Cloudberry Explorer has made the job so much easier because I can look at my different buckets through the lens of a typical filesytem that I am used to and can browse around, download and open files with a click of the mouse and can even make quick updates when I need to. Cloudberry gives me a beautiful user interface for working with some of the more advanced features of S3 like ACL's, Bucket Policies, Cloudfront distributions, External Buckets and more and it makes these features much more accessible than the straight Amazon S3 API for quick tasks I need to do. If I ad to work within the confines of the Amazon API or even the Amazon Web interface for doing this debugging I would be entirely hamstrung and my life would be in shambles from the craziness of things.....

Copy or Move Jobs

Another thing that I find myself doing a lot is moving files around within S3. We use specific naming conventions for S3 files to denote the working state of a file. We also use different buckets in each Amazon region to reduce cross datacenter chatter. But there are often times that I find I just need to copy or move a whole slew of files from one place to another. Recently I actually had the need to move over 1 million files between buckets. For this I use Cloudberry. Moving and Copying is a drag and drop task within Cloudberry. And for some of those bigger jobs (like the million files) I can use up to 100 threads to get the job done more quickly. The ease of sing Cloudberry for these types of tasks has gotten me to be a little too dependent on the tool, I've actually spun up Amazon Ec2 instances just to put Cloudberry on to do large copy/move jobs and then tore them down. That allowed me to have more CPU power going as well as even more threads working on my job.

Scripting 

One thing that we've recently discovered is that everything that is available in the CloudBerry S3 Explorer Pro version is also available in Windows Powershell snapins. We've utilized these
extensively to script out tasks that we find ourselves doing over and over again. While we have our own tools that use the Amazon S3 API to interact with S3 from within our applications, I've found that the Cloudberry Powershell snapins are more reliable and much easier to use due to the scripting capabilities of Powershell. Now each time I find myself doing something in Cloudberry I ask if it is something that I should script out for future use. Often times I find that a few minutes adding new capabilities to my script toolbox using these snapins ends up saving countless hours down the road. 

Summary

If you're using Amazon S3 for anything you're doing in your business you need to go out and get a license for CloudBerry S3 Explorer Pro. This tool is one of the most useful tools that I have found, not only do I use it daily but I bought licenses for everyone on our engineering team and they all use it pretty much daily as well. Cloudberry also makes tools for many of the other cloud based storage solutions. Pretty cool. Thanks a lot to the guys over at Cloudberry Labs for such a great tool!


Mike Davis

Tuesday, January 25, 2011

Review All Your Invoices

The blog has been getting a little neglected lately, but we're still working on the products. We just released a new feature that allows customers to access their past invoices. We always email you your invoice as soon as your credit card is charged, but now you are able to go into the system and view all of your past invoices. Simply login to app.centripetalsoftware.com and click on the Account tab at the top. This will show you a summary list of all of your invoices.





From there you can click on any one of the invoices to access a detailed invoice that you can print out.





This is all done by accessing payment information that is stored in our Level 1 PCI DSS Compliant payment processor's system. So all of your credit card information is secure and no information will be available that could compromise your security.

Saturday, August 28, 2010

Payment Information Updating

We’ve been working on a ton of new features and have released some of them, but the blog has been getting neglected. So I thought I’d throw up a tidbit of what we’re working on. We’ve been building out the admin section of the application to allow users to better manage the information about them and their account. The most recent bit we’re about to push out is a mechanism to update payment information. One of the downfalls of a recurring service that is automatically billed is that payment information gets out of date. It’s difficult to get people to update it, especially if there is no way to update it, so we’re working on making that a little bit easier.

 

PaymentInformation

Friday, July 16, 2010

Carbonite Review

Carbonite (found at carbonite.com) is a back up program for your computer’s files, music, videos, and more. It boasts of unlimited backup capacity, completely automatic retrieval, secure and encrypted file storage, and easy file recovery. The cost is $54.95/year with a free, no credit card required, trial.

According to their website, “The current version of Carbonite is designed for Windows XP, Windows Vista, and Windows 7. Carbonite supports both the standard 32-bit and 64-bit versions of each. Carbonite will not support older versions of Windows (Windows 98, Windows 2000, and Windows ME) or Linux operating systems. Carbonite is also available for all Intel-based Macs running OS 10.4 (Tiger), 10.5 (Leopard), and 10.6 (Snow Leopard).”


Carbonite works by installing a small application on your computer that works in the background. There are no limits on the backup storage capacity, but Carbonite does warn that because of DSL speeds, a larger file file back up will take considerably longer to upload to their server.

I signed up for the free trial and have started the process as I write this review. I am running it on an Intel-based MacBook version 10.6.4. The installation process was simple and straightforward, however I was not pleased to read, after I started the program, that the initial back up could take several days.

Because I’m running this on a laptop, my computer will not stay on continuously for that amount of time. I’m not sure how this will affect the usefulness of this program. It may be better suited for a desktop computer that can stay connected to the Internet continuously in awake or sleep mode.

Once you install Carbonite, you can just let it run its course. It works in the background when your computer is idle, backing up new and changed files. All files are encrypted twice, using the same security measures banks do, and the information stays secure, only accessible for you to retrieve it if need be.

The retrieval process seems simple as well. A few clicks and your important files are brought back to your computer. You have the option of selecting which files to backup, and it’s important to check the preferences of the program to customize the default files Carbonite updates.

As a backup system, I believe Carbonite is worth looking into and considering. However, it is subject to being connected continuously to the Internet and DSL speeds. I think it would be a great option for desktop computers and their Carbonite Pro packages for multiple computers, is worth looking into for small businesses.

*It’s been one hour since I installed Carbonite on my Macbook and my initial backup is still initializing, no files have been uploaded to the Carbonite system yet... time will tell if I will continue with Carbonite as my backup solution. I will post an update to this post when (or if) my backup completes.

Kjaere Friestad
- guest blogger :: San Francisco, CA :: for Centripetal Software

Monday, July 5, 2010

1% for the Planet

We recently became a member company of 1% for the Planet. We did this because we care deeply about our planet and for all that God created, and want to ensure that our business does everything possible to help to better the place that we live. 1% for the Planet exists to build and support an alliance of businesses financially committed to creating a healthy planet. What Centripetal Software is committing to is giving 1% of sales to one of the environmental organizations listed on the 1% for the Planet web site. We will donate directly to a nonprofits—not through 1% FTP. We do everything we can to ensure that the way we run our business causes a minimal impact on our planet and we are committed to helping organizations that are working to make this a better place to live.