Quantcast
Channel: Badoo Tech Blog
Viewing all 132 articles
Browse latest View live

The technology of billing - how we do it at Badoo

$
0
0

There are many ways to monetize your project, but all of them have one thing in common – the transfer of money from the user to a company account. In this article we will discuss how this process works at Badoo.

What do we mean by ‘billing’?

Billing for us concerns all things related to the transfer of money. For example: pricing, payment pages and payment processing, the rendering of services and promo campaigns as well as the monitoring all of these things.

In the beginning, as with most startups, we had no paid services at all. The first steps towards monetization took place in 2008 (well after the official site launch in 2006.) We selected France as our guinea-pig and the only available payment method at that time worked via SMS. For payment processing we used a file system. Each incoming request was put into a file and moved between directories by bash-scripts, meaning its status changed during processing. A database was used only for registering successful transactions. This worked pretty well for us, but after a year this system became difficult to maintain and we decided to switch to using just a database.

This new system had to be re-worked quickly, as up till then we had been accepting payments in only a limited number of countries. But this system had one weak point – it was designed solely for SMS payments. To this day we still have some odd leftovers of this system in our database structure, such as fields MSISDN (mobile phone number) and short code (short number for premium SMS) in a table of successfully processed payments.

Now we receive payments from countries all over the world. At any given second at least a few users are trying to buy something on Badoo or through our mobile applications. Their locations are represented in this “Earth at Night” visual:

Earth

We accept payments using more than 50 payment methods. The most popular are credit card, SMS and direct billing, and purchases via the Apple Store and Google Play.

Pay

Among them you can find such leftfield payment options as IP-billing (direct payments from your internet provider account), landline payments (you have to call from your landline and confirm payment). Once we even received a payment via regular mail!

Letter

Credit card and bank payments

All payment systems have an API and work by accepting payments from their users. Such direct integrations work well if you have only a few of them and everything runs smoothly. But if you work with local payment systems it starts to become a problem. It is becoming harder and harder to support a lot of different APIs for several reasons: local laws and regulations are different, a popular local payment system provider may refuse to work with foreign clients, even signing a contract can draw out the process substantially. Despite the complexity of local payment methods though, adopting many of them has proven to be quite a profitable decision. An example of this is the Netherlands, which had not previously been a strong market for us. After we enabled a local payment system named iDeal, however, we started to take in 30-40% more profit.

Where there is demand, usually there’s someone ready to meet it. Many companies known as ‘payment gateways’ work as aggregators and unify popular payment systems – including country-specific ones – under one single API. Via such companies, it suffices to perform an integration only once and after that one gets access to many different payment system around the world. Some of them even provide a fully customizable payment page where you can upload your own CSS & JS files, change images, texts and translations. You can make this page look like part of your site and even register it in a subdomain such as “payments.example.com”. Even tech-savvy users might not understand that they just made a payment on a third-party site.

Which is better to use? Direct integration or payment gateways? First of all it depends on the specific requirements of the business. In our company we use both types, because we want to work with many different payment gateways and sometimes make direct integrations with payment systems. Another important factor in making this decision is the quality of service provided by a payment system. Often payment gateways offer more convenient APIs, plus more stable and higher-quality service than source payment system.

SMS payments

SMS payments are very different to other systems. In many countries they are under very strict control, especially in Europe. Local regulators or governments can make demands regarding all aspects of SMS payments. For example specifying the exact text sent via SMS or the appearance of the payment page. You have to monitor changes and apply them in time. Sometimes requirements can seem very strange, for example in Belgium you must show short code white on black with price nearby. You can see how this looks on our site below.

SMS

Also there are different types of SMS-billing: MO (Mobile Originated) and MT (Mobile Terminated). MO-billing is very easy to understand and implement. As soon as a user sends an SMS to our short number we receive money. MT is a bit more complicated. The main difference is that a user’s funds are not deducted from the moment he or she sends the SMS, but when a message from us is recieved with a notification that he or she is being charged. Through this method, we get the money only after receiving delivery notification of this payment message.

The main goal of MT-billing is to add an additional check on our side before the user sends money, preventing errors that occur due to user-misspelled SMS texts. Using this method, the payment process consist of two phases. First, the user initiates payment and second, they receive confirmation. In some countries the payment process for MT-billing follows one of these variants:

  • the user sends an SMS on short number, we receive it and check that the text is correct, etc. We send a free message with custom text, which the user has to answer, confirming the payment. After that we send a message that they have been charged
  • same as above, but instead of responding directly to the free message the user has to enter a PIN code from it on the Badoo site
  • the user enters their phone number on Badoo, we send a free message with a PIN. The user then enters the PIN code on Badoo, and after checking this, we send the payment message

For SMS payments we use only aggregators. Direct integrations with operators are not profitable, because you have to support a lot of contracts in many countries, which increasingly requires the involvement of accountants and lawyers.

Technical details

Badoo works on PHP and MySql. For payment processing we also use the same technologies. However billing application works on separate pools of servers. These are divided into groups, such as servers to process income requests (payment pages, notification from aggregators, etc), servers for background scripts, database servers and special groups with increased security where we process credit cards payments. For card payments, servers must be compliant with PCI DSS. Its security standards were developed in coordination with Visa, Master Card, American Express, JCB and Discover for companies who process or store the personal information of their cardholders. The list of requirements which have to be met to use these systems is quite long.

As database servers we use two MySql Percona servers, working in master-master replication. All requests process via only one of them - the second is used for hot-backup and other infrastructure duties, such as heavy analytical queries, monitoring queries and so forth.

The whole billing system can be divided into few big parts:

  • Core - the base entities needed for payment processing such as Order, Payment and Subscription
  • Provider plugins - all provider-related functionality such as implementation of API and internal interfaces
  • Payment page - where you can choose a product and payment method

In order to integrate a new payment provider, we need to create a new plugin which is responsible for all communication between us and the payment gateway. These can be of two types, depending whether we initiate the request (pull requests) or the payment provider initiates it (push requests). The most popular protocol for pull-requests is HTTP, either in itself or as transport for JSON/XML. The REST API (which has gained a certain degree of popularity recently) we haven’t encountered very often. Only new companies or companies who reworked their API recently offer it. For example with the new PayPal API or the new payment system used by the UK’s GoCardless company. The second most popular transport for pull requests is SOAP. For push requests mostly HTTP is used (either pure or as transport), and SOAP only rarely. The only company that comes readily to mind that offers SOAP push notifications is the Russian payment system QIWI.

After the programming part is finished the testing process begins. We test everything several times in different environments: the test environment, in shot (internal domain with only one particular task and working production environment), in build (pre-production version of code which is ready to go to live) and in the live environment. For more details about release management at Badoo please visit our blog: (http://techblog.badoo.com/blog/2013/10/16/aida-badoos-journey-into-continuous-integration/).

For billing tasks there are some peculiarities. We have to test not only our own code but how it interacts with third party systems. It’s nice if the payment provider offers their own sandbox which works the same as our production system, but if not we create stubs for them. These stubs emulate a real aggregator system and allow us to do manual and automatic testing. This is an example of a stub for one of our SMS providers.

Letter

After passing through the test environment we check how it will work with the real system, i.e. making real payments. For SMS payments, we often need to get approval from local regulators, which can take a few months. We don’t want to deploy semi-ready code on production so as a solution we create a new type of environment external shot. This is our regular shot, a feature branch with one task, but accessible by external sub-domain. For security reasons we create them only if needed. We send links to external shots to our partners and they can test changes at any time. It’s especially convenient when you work with partners from another hemisphere where the time difference can be up to 12 hours!

Support and operation

After a new integration goes live we enter the stage of its support and operation. Technical support occupies about 60-70% of our work time.

Support

By support I mean primarily customer support. All easy cases are solved by the first line of support. Our employees know many different languages and can translate and attend to customer complaints quickly. So only very complicated cases end up on the desks of our team of developers.

The second component of support is bug fixing or making changes to current integrations. Bugs appear due to multiple reasons. Of course the majority are a result of human error, i.e. when something is implemented in the wrong way. But sometimes it can result from unclear documentation. For example, once we had to use a Skype chat with a developer of a new payment system instead of documentation. At other times a payment provider makes changes on their side and forgets to notify us. One more point of failure is third party systems, as a payment provider’s aggregate payment services error can occur not on their side, but on their partner’s side.

In order to solve such cases quickly we maintain detailed logs. These contain all communications between us and payment providers, all important events, errors during query processing and so on. Each query has its own unique identifier through which we can find all rows in logs and reconstruct the steps of an execution query. It’s especially helpful when we have to investigate cases that happened a few weeks or months ago.

So that’s how billing is organized at Badoo! There are still many interesting topics we plan to explore in future, such as monitoring, PCI DSS certification, and re-working bank-card payments. If you have questions or suggestions for future articles, please leave a comment for us below.


Parallel Calabash Testing on iOS

$
0
0

We want our users to have the same experience of Badoo on mobile regardless of which platform they run, so we have a battery of 450 Cucumber tests and growing that we run against all platforms: Android, iOS, and mobile web (we’re working to add Windows). These 450 tests take between 8 and 12 hours to run, which makes for a very slow feedback loop, which makes for grumpy developers and grumpy managers.

Although Calabash only supports a single thread of testing at a time, it’s easy enough to run with multiple Android devices using the Parallel Calabash project from GitHub: just plug in a few phones and/or simulators, and the tests complete in a fraction of the time. The more devices, the faster things go, turning those frowns upside-down.

Unfortunately, it’s long been understood that running multiple iOS tests on a single MacOS host is impossible: you can only run one simulator at a time, the ‘instruments’ utility can only control one simulator at a time, and because of this anyone automating iOS - like Calabash - has made reasonable assumptions that reinforce the situation.

A little knowledge is sometimes a useful thing

It was obvious that we could run these tests in parallel on one host by setting up virtual machines, but that’s quite a heavyweight solution. After some simple experiments (which I conducted because I didn’t know any better) it became clear it would be possible to run the tests on the same host as two users, each with a desktop: one normal log-in, the other via TightVNC (because Apple’s Screen Sharing doesn’t allow you to log into your own machine).

A little checking around on the web revealed that with XCode 6.3, instruments quietly began to be able to run multiple devices at once from the command-line. We haven’t noticed any announcement from Apple about it, and we’ve found that it’s only XCode 7 that makes it reliable for us, but it’s a welcome change all the same.

With this insight, I set about adding iOS support to the Android-based Parallel_Calabash project on GitHub - not the most elegant solution, but eminently practical.

The long and winding road

Initially it was an optimistic tinkering: copying the application directory for each ssh user and tweaking the plists to set the Calabash endpoint to different ports. Then followed a substantial amount of trailing around inside Calabash-iOS to understand its internals to find out why only one simulator was resetting, why they were pretending to be real phones, and various other issues.

I worked out a trivial Calabash monkey-patch to resolve the resetting issue and puzzled though the assumptions causing the rest.

Then there was a similar amount of hair-pulling to arrange that the test user accounts each had an active desktop: I initially started looking into TightVNC to disable its pop-up dialogue, then send automated keypresses to choose a user and type its password, but switched to Apple’s Screen Sharing once I found that it could already be supplied with a username and password directly (vnc://user:pass@host:port), and foxed into logging back into the local machine with an ssh tunnel (ssh -L 6900:localhost:5900 user@localhost).

I had an unexpectedly frustrating and drawn-out struggle to get the logging-in automated: despite having been told to log in as a particular user, Apple’s Screen Sharing insists on asking

Apple Screen Sharing... ORLY!? YARLY!!

There seems to be no way to forestall that question (such as, adding ?login=1 to the VNC URL), so I ended up trying to run AppleScript from a shell script to respond, and then fighting with the UI Automation security system to authorise it - and losing. I ultimately rewrote the shell script completely in AppleScript - an experience reminiscent of coding 1970s COBOL - and then fighting with the UI Automation security system to authorise that instead - the result is misc/autostart_test_users, mentioned below.

Then I had to find out how to authorise that automatically on each testing host, and finally how to create all the test users automatically, without needing to confirm manually on the first login that - yes, these robots aren’t interested in iTunes accounts. You’ll find this last bit easily enough if you search the web for DidSeeSyncSetup, but you have to know to look for that! This is all coded into misc/setup_ios_host.

The final hurdle was that some of our tests do a little bit of screen-scraping with Sikuli to log in to Facebook. I discovered that we needed to use Java 1.7 (not 1.8) and set ADK_TOOLKIT=CToolkit to let a Java program invoked from an ssh context access the desktop.

So now we have a working system, and the maintainer of Parallel Calabash has incorporated the iOS solution into the main branch on GitHub.

Setting up

Each test machine has a main user, in our case the TeamCity agent user, and a set of test users. I’ve automated most of the set-up.

  1. On a Mac/Unix machine, run: git clone https://github.com/rajdeepv/parallel_calabash

  2. Now edit parallel_calabash/misc/example_ios_config if you need to - perhaps add a few more to the list of USERS (we’re using 5 for sysctl hw.logicalcpu = 8). It’s safe to run this script against a machine that’s already been set-up, if you want to change the settings.

  3. Make sure you can ssh to your main test user on the test machine (the test user must be an administrator account), then run cd parallel_calabash/misc; ./setup_ios_host mainuser targethost example_ios_config It should report success for each new QA user.

  4. Add misc/autostart_test_users.sh to your build script (or copy/paste it in).

  5. Then change your build’s invocation of calabash to use parallel_calabash as follows:

Parallel_calabash performs a dry run to get a list of tests that need to be run, and then runs the tests for real, so you will need to separate your calabash options into report-generation options (which aren’t used during the dry run) and the other options. For instance:

bundle exec parallel_calabash
  --app build/Applications/Badoo.app
  --ios_config /Users/qa/.parallel_calabash.iphonesimulator
  --simulator 'com.apple.CoreSimulator.SimDeviceType.iPhone-6 com.apple.CoreSimulator.SimRuntime.iOS-8-3'
  --cucumber_opts '-p ios_badoo IOS_SDK=iphonesimulator'
  --cucumber_reports '--format pretty -o build/reports -p parallel_cucumber_reports'
  --group-by-scenarios
  --device_target 'iPhone 6 (8.3)'
  --device_endpoint 'http://localhost:37265'
  features/

Briefly:

  • --app puts parallel_calabash in iOS mode, and tells it where to find the app build.
  • --ios_config says which config to use - we select between simulator or device configs.
  • --simulator says which simulator to clone for parallel users
  • --cucumber_opts gives the options for use with dry_run to get a list of tests
  • --cucumber_reports gives the options for use with the actual test-runs
  • --group-by-scenarios says to share tests equally between the parallel users
  • --device_target and –device_endpoint gives the default simulator to use if the configuration doesn’t specify test users (in which case, the main user will run all the tests by itself, on its own desktop, as if you were using calabash-ios directly)
  • features/ is the standard general test specification.

Next

There are a few developments I’m planning to implement:

  • Allowing parallel_calabash to control users on additional machines, for even more parallelism.
  • Feeding past test timings to parallel_calabash so it can try to arrange for all users to finish at the same time - less idle time means faster test results (Ideally, all this parallelism would happen within Cucumber, and Cucumber’s threads would request a test from a central list as it completes each current test, but that’s complex).
  • Using TightVNC instead of Apple’s Screen Sharing to enable video capture of each test: if a secondary user tells QuickTime to record, it actually records a garbled version of the primary desktop. I would arrange for TightVNC to encode its own video stream. Currently, we enable video capture only during a re-run of fewer than 40 failures which has been forced into non-parallel mode by using --device_filter=something-matching-no-device.
  • While I’m messing about with VNC, I might also investigate a Sikuli-like screen scraping scripting mechanism - since I’ll need to arrange something like that to log in anyway.

Badoo Coding Challenge

$
0
0

Welcome to our first edition of the Badoo Coding Challenge! Below, you will find all the relevant information about the challenge. We hope participants will enjoy solving the problems as much as we enjoyed designing them.

Badoo Tech

Winners and prizes

  • Finalists - There will 25 finalists where each one will receive a goodie bag.
  • Winners - Out of the 25 finalists 5 winners will be chosen. Each winner will get £1000 and the chance to for a 1 day visit in London in Badoo HQ.
  • Overall winner - Out of the 5 winners one will be declared as the overall winner. On top of the “Winners” prize they will get a handset of their choice with a value of up to £700.

How does it work?

Challenge problems


For each problem you will have to submit a solution. The solution can be written in any language they chose. Given a certain input that we will provide, your solution must generate the correct output. Each problem has a description, and an example of an input and output.

Getting to work: Each problem has two phases. The test phase and the final submission:

  1. The test phase will help you to check if your code works correctly with a simple input. You will be able to download a test input, run it through your solution and upload your output. We will let you know if this output is correct or not. If it’s not you should change your code until its output is correct. You can repeat this step as many times as you want until you upload a correct output which is a mandatory step to progress to the next phase. The output file should be a plain text file.

  2. In the final submission you will download a more complex test input. You should then apply it to your solution and submit the output and the solution. Notice that unlike the test phase you will not be notified of the correctness of the output.

You will be able to proceed to the next problem only if you have successfully completed the two phases of the current problem.


Let’s practice…

The Euclidean algorithm

The Euclidean algorithm, or Euclid’s algorithm, is an efficient method for computing the greatest common divisor (GCD) of two numbers, the largest number that divides both of them without leaving a remainder. It is named after the ancient Greek mathematician Euclid, who first described it in Euclid’s Elements (c. 300 BC). It is an example of an algorithm, a step-by-step procedure for performing a calculation according to well-defined rules, and is one of the oldest numerical algorithms in common use. It can be used to reduce fractions to their simplest form, and is a part of many other number-theoretic and cryptographic calculations. [source: Wikipedia]

Can you write a program that implements it?

Input

First line will contain the number of test cases T.T blocks will follow with: One line containing three integers:A that indicates the first integer.B that indicates the second integer.G that indicates a tentative greatest common divisor (GCD) of A and B.

Output

The program should output one line per test case with OK if the provided GCD is right or the real GCD if it’s different.

Limits

T <= 10

A, B, G <= 100

A, B, G > 0

Sample input

3
54 32 2
33 55 5
42 7 7

Sample output

OK
11
OK

There’s still time for you to register and create your profile. Don’t miss your chance and hope to see you during the challenge!

Badoo Tech Team

How we made Chatto

$
0
0

Our chat was old, having evolved over the years into a massive view controller with weird fixes that nobody could understand. It was difficult to add new types of messages and new bugs were easily introduced. So we decided to rewrite it from scratch on Swift and make it open source.

We started with two goals in mind:

  • Scalable architecture: new types of messages should be easy to add and shouldn’t affect existing code.
  • Good performance: we wanted super smooth loading of messages and scrolling.

I will mainly focus on implementation details, what approaches we took and how we ended up with the final result. There’s a fairly good overview of the architecture on our GitHub page.

UICollectionView vs UITableView

Our old chat uses UITableView. There’s nothing wrong with it, but UICollectionView offers a richer API with more possibilities for customization (animations, UIDynamics,…) and optimization (custom layouts, invalidation contexts).

Not only that, but, we also researched some existing chat applications and all of them were using UICollectionView. So it was a no-brainer decision to go with UICollectionView.

Text messages

No chat exists without text bubbles. In fact, they’re the most challenging messages to implement in terms of performance, because rendering and sizing text is slow. We wanted to have link detection with native actions, like iMessage does.

UITextView offered all those requirements out of the box, without the need to write a single line to handle link interaction. That’s why we chose it, a decision which turned out to be a painful one, as you’ll see.

Auto Layout & Self-Sizing cells

Layout and size calculation have always been a source of problems: it’s really easy to end up with duplicated code that hinders maintainability and causes bugs to appear, so we wanted to avoid this. Since we were supporting iOS 8 onwards we decided to try Auto Layout and sizing cells. Here’s a branch with a rough implementation of this approach. We faced two big issues:

  • Jumps when scrolling up: The flow layout calls preferredLayoutAttributesFittingAttributes(_:) on the UICollectionViewCell in order to get the real size. As it doesn’t match the estimated, it then adjusts the position of the existing displayed cells, making them jump down. We could have worked arounded this by inverting both collection views and cells with a 180 transform, but there was another issue, which was…
  • Poor scrolling performance: we didn’t get 60 fps when scrolling down even with real sizes already calculated. The bottleneck was the Auto Layout engine and the UITextView sizing. We weren’t really surprised about this, as we knew that Apple doesn’t use Auto Layout on iMessage cells. I should point out here that I’m not saying Auto Layout shouldn’t be used; in fact, we use it extensively at Badoo. However, it does come with a performance hit that typically affects collection/table views.

Manual layout

So, we went with the traditional layout approach instead. We chose the classical method of using a dummy cell to calculate the sizes and be able to reuse as much code as possible between sizing and layout. This performed much better, but not good enough for an iPhone 4s. Profiling showed that too much work was being done in layoutSubviews.

Basically, we were doing the same work twice: once to calculate the size, and again right before moving a cell to the screen, on layoutSubviews. Why not cache those UITextView sizeThatFits(_:) that were so costly to compute? We went a bit further than that and created a layout model for the cell, where the size and all frames of the subviews were calculated and we cached it. As a result, not only was the scrolling performance noticeably improved, but we also achieved perfect code reuse between sizeThatFits(_:) and layoutSubviews.

Apart from that, we noticed another method on the heaviest stack trace, updateViews. This was a central (but small) method responsible for updating the views according to the specified style and the data in question. Having such a central method was good for flow reasoning and maintainability, but it was being triggered for almost every cell property setter. We came up with two optimisations to alleviate this problem:

  • Two different contexts: .Normal and .Sizing. We used .Sizing for our dummy sizing cell so we could skip some unnecessary updates like updating the bubble image or disable link detection in UITextView.
  • Batch updates: We implemented performBatchUpdates(_:animated:completion) for cells. This allowed us to update all the setters in the view but only trigger an updateViews call.

More performance

We already had good scrolling performance, but loading more messages (in batches of 50) was blocking the main thread for too long, causing the scrolling to halt for a fraction of a second. Bottleneck was of course again UITextView.sizeThatFits(_:). We managed to make it considerably faster by disabling link detection, selection, and allowing non-contiguous layout in our dummy sizing cell:

textView.layoutManager.allowsNonContiguousLayout=truetextView.dataDetectorTypes=.NonetextView.selectable=false

Having done this, adding 50 messages at once was no longer an issue, given that there weren’t too many messages already. But we thought we could try to take it a step further.

Given all the abstractions we had built with the layout model being cached and reused for both sizing and layout, we had everything in place to try and do the calculation in the background. Everything… but UIKit.

As you know, UIKit isn’t thread safe, and our first strategy (which was just to ignore this fact) caused some unsurprising crashes on UITextView. We knew we could use NSString.boundingRectWithSize(_:options:attributes:context) in the background, but it wasn’t matching the values of UITextView.sizeThatFits(_:). It took us a while, but we found a solution:

textView.textContainerInset=UIEdgeInsetsZerotextView.textContainer.lineFragmentPadding=0

and round NSString.boundingRectWithSize(_:options:attributes:context) to screen pixels with

extensionCGSize{funcbma_round()->CGSize{returnCGSize(width:ceil(self.width*scale)*(1.0/scale),height:ceil(self.height*scale)*(1.0/scale))}}

This way, we were able to warm the cache in the background and then retrieve all the sizes very fast on the main thread… provided that our layout didn’t have to deal with 5,000 messages.

In this case, the iPhone 4s was struggling a bit on our UICollectionViewLayout.prepareLayout(). The main bottlenecks were creating UICollectionViewLayoutAttributes objects and retrieving the 5,000 sizes from NSCache. How did we improve this? We just did the same as with the cells. We created a plain layout model object supporting our UICollectionViewLayout and moved it to the background as well. Now in the main thread we were just replacing the old model with the new one. Everything was amazingly smooth, except for…

Rotation and split-view

This wasn’t really a problem for us, as we don’t support rotation, but we already knew that we wanted to open-source Chatto and thought it would be a big plus if we could support rotation and split-view nicely. We already had background layout calculation with smooth scrolling and addition of new messages, but that didn’t help all that much when our layout was having to deal with 10,000 messages. Calculating so many text sizes was taking 10-20 seconds on an iPhone 4s, depending on the size of the messages, and we obviously couldn’t make the user wait that long. Two solutions occurred to us:

  • Calculate sizes twice, once for the current width and again for the width as if the device was already rotated.
  • Avoid dealing with 10,000 messages.

The first solution is more of a hack than a proper solution - it doesn’t help much in split-view, and it doesn’t scale. So we went for the second solution:

Sliding data source

After some testing on the iPhone 4s we concluded that supporting fast rotation meant handling a maximum of 500 messages, so we implemented a sliding data source with this (configurable) parameter. Opening a conversation would initially load 50 messages, and then we would add other 50 while the user was scrolling up to retrieve older ones. Once the user had scrolled up far enough we would start forgetting the first ones, so we had pagination in both directions. This wasn’t too difficult to implement, but there was a problem when the data source was “full” and a new message was inserted.

When we had 500 messages already and a new message was received we had to remove the first one, shift all the others one position up and insert the newly-received message. Again, this wasn’t difficult to implement, but UICollectionView.performBatchUpdates(_:completion:) didn’t like it. There were two main issues, which you can reproduce here:

  • Sluggish scrolling and jumps when receiving many messages.
  • Broken animation on message insertion due to changes in the content offset origin.

In order to solve these glitches we decided to relax the constraint of having a strict maximum number of messages. We would allow insertions to break the limit rule so collection updating was smooth. Once insertion had been completed, and no more changes were pending in the update queue, we would issue a “too many messages” warning to the data source. The adjustment would then be handled separately with a reloadData instead of performBatchUpdates. As we didn’t have much control over when this would happen and given that user could have scrolled to any position, we needed to tell the data source where the user had scrolled to, not to get rid of messages the user was seeing at that moment:

publicprotocolChatDataSourceProtocol:class{...funcadjustNumberOfMessages(preferredMaxCountpreferredMaxCount:Int?,focusPosition:Double,completion:(didAdjust:Bool)->Void)}

UITextView hacks

So far, I’ve mentioned Auto Layout performance issues, sizing performance issues, and obstacles to getting alternative sizing for background calculation using NSString.boundingRectWithSize(_:options:attributes:context).

To benefit from link detection and native action handlers, we had to enable the selectable property of the UITextView. This came with some unwanted side effects to our bubble, like free range text selection and magnifier glass. To support those features, UITextView also adds a handful of gesture recognisers that were interfering with selection and long presses in the text bubbles. I won’t cover in detail the hacks we made to work around these issues, but you can check ChatMessageTextView and BaseMessagePresenter.

Interactive keyboard

Not only did UITextView caused the aforementioned problems, it also affected the keyboard. Interactive dismissal of the keyboard should be a fairly easy thing to achieve nowadays. You just need to override inputAccessoryView and canBecomeFirstResponder in your view controller as explained here. However, this wasn’t working well with the UIActionSheets presented by UITextView if the user long-pressed on a link.

Basically the action sheet was appearing underneath the keyboard and wasn’t visible at all. There’s another branch where you can play with this issue (rdar://23753306).

We had to place the input component in the regular hierarchy of the view controller, listen to the keyboard notifications and change the collection view insets manually, as has always been done. However, no notification is received when the user is interacting with the keyboard, and the input bar stayed in the middle of the screen, leaving a gap between it and the keyboard as the user dragged it down. The solution to this is kind of a hack, and consists of placing a dummy input accessory view and observing it via KVO. You can find more details here.

TL;DR

  • We tried Auto Layout, but we had to move to manual layout as performance wasn’t good enough.
  • We evolved to a layout model idea that allowed us to reuse code in layoutSubViews and sizeThatFits(_:), and enabled layout calculation in the background. Turns out we somehow coincided with some ideas in AsyncDisplayKit
  • We implemented performBatchUpdates(_:animated:completion) and two different contexts for cells to minimise view updates.
  • We implemented a sliding data source with message count containment to achieve fast rotation and split-view size changes.
  • UITextView was really painful to adopt, and is still a bottleneck in scrolling performance on older devices (iPhone 4s) due to link detection. We stuck with it because we wanted native actions when interacting with the links.
  • Because of UITextView, we had to manually implement interactive dismissal of the keyboard by observing a dummy view via KVO.

– Badoo iOS Team –

Badoo Coding Challenge - Finalists and Winners

$
0
0

Thank you to everyone who took part in the Badoo Coding Challenge. We are very happy to announce the names of our finalists and winners below. We will be in contact with all finalists and winners via email very soon.

The Сoding Сhallenge problems are open to everyone to solve for fun! Take a look at them here: https://challenge.badoo.com/archive

List of winners (in alphabetical order)

  • Alfredo B.
  • Alvaro G.
  • Daniel R.
  • Jorge L.
  • Jose Maria P.

List of finalists (in alphabetical order)

  • Aaj J.
  • Adrian S.
  • Alex Car.
  • Alex J.
  • Alexander V.
  • Ivan N.
  • José P.
  • Marian D.
  • Michał S.
  • Miguel R.
  • Mihail K.
  • Mijaíl V.
  • Nick A.
  • Roma S.
  • Rosa G.
  • Thijs T.
  • Vasiliy St.
  • Victor G.
  • Yulia K.
  • Zoltan E.

Thanks,

Badoo Tech Team

Mobile App Testing - Tips and Tricks

$
0
0

Our new article is in fact a list of tips and tricks. These tips will help beginners to progress faster while more experienced users will be able to streamline what they know. The article will also be useful for developers, product and project managers, and for anyone who would like to improve both product quality and inter-departmental relations.

You will learn:

  • How to make the process of mobile app testing easier in general;
  • About particular features of working with the network, with internal and external services, and with the iOS and Android platforms;
  • Which process solutions and changes will let you develop faster and introduce a testing culture to your development department;
  • About useful instruments and solutions for testing, debugging, monitoring, and user migration.

How can you improve the testing process?

  1. Apply heuristics and mnemonics, since they help you memorise all aspects that need to be considered when testing a feature or an application.
  2. Screenshots, logs and video are a tester’s best proof-points.

    Unfortunately, server communication logs are not as easy to handle as client logs. They are usually added more for the developer’s convenience when debugging communications with server than for the tester’s benefit.
    • Please ask developers of clients and servers to export all server requests and responses into a convenient no-nonsense interface for log viewing. It will become easier to analyse server requests and responses, pinpoint duplicates, and find more convenient ways updating data.
    • For example, a developer may have to re-request the entire profile in order to update only a part of it instead of applying a more lightweight request. In situations where the location of a problem is unclear, a combination of server and client logs can help tackle the problem faster in most cases.
  3. You can use test “monkeys” to pinpoint crashes and hang-ups while you yourself are busy with more intelligent functionality tests. The most efficient test method is to combine test monkeys with telemetry tools in order to speed up troubleshooting (with TestFairy, for instance). TestFairy has recently started supporting iOS as well, but its functionality so far is limited.
  4. If you want to feel more confident before a release, you can use a beta version as a back-up network. Having two or three specialists will certainly not be enough to cover all combinations of cases and various devices (particularly for Android), and in this case you’ll be able to get help from beta users worldwide, which will lessen the workload on the test team. I highly recommend using a TestFairy wrapper for a beta version.
    • Android beta programme is better than its iOS equivalent in this respect: you can invite users via Google+ or accept invitations from them. The number of beta users is not limited.
    • iOS with TestFlight (bought by Apple) has some artificial limitations: a maximum of 2,000 users and a mandatory review of the first beta version. Distribution services can be used for beta software as well.
  5. It is a good idea to get a debugging menu with functions that make life easier for developers and testers (especially for the automation team). Functions can include simulating responses from a server, opening certain users, setting particular flags, cleaning and losing sessions and clearing caches. Our mobile applications feature a multifunctional debugging menu, and I quickly reached a point where I couldn’t imagine doing manual or automated testing without it.
  6. The developer menu in iOS and Android really is your best friend, and you should switch it on for iOS and Android. In iOS it can do the following:
    • Enable Network Link Conditioner;
    • Enable traffic and power usage logs;
    • Test iAd ads in a more convenient way.

      Android options are even better, featuring multiple settings for any demand, from showing CPU and RAM utilisation to changing interface animation speed.
  7. If an application supports both portrait and landscape format, you should pay close attention to screen orientation changes. These can result in crashes, memory leaks and returns to the previous state.
  8. Switch between screens many times.
    • For iOS, you should check for the following: memory operations work correctly (to prevent access to wrong memory areas,and prevent updates to screens that are already hidden) and memory leaks.
    • Memory leaks are possible in Android, as something may have locked up the previous activity.

    It’s a good idea to switch screens while the application is interacting with the network:

    • Incomplete requests should be cancelled;
    • Server response to an invisible (i.e. deleted from the memory) screen should not crash the application.
  9. Don’t overlook testing on emulators and simulators– those are really convenient and make some test scenarios easier. In iOS, for example, this makes testing location changes, background location updates, hotkey memory warning simulations and slowed-down animations much easier. In Android, you can configure exotic hardware settings: screen resolutions, pixel density, RAM size, heap size, and internal and external memory size.
  10. Fill up the device RAM before launching an app. Firstly, it will let you run a stress test and check operation speed. Secondly, it will let you check that you can save and resume the app state (i.e. where do we return when the app was minimised? Will all the required services run?).
  11. Run the app with debugger connected. Why?
    • Chances are that you will achieve enlightenment :)
    • It enables slow stepping through the app, which can sometimes reveal bugs.
    • If a crash or exception happens in an app, it will stop on a breakpoint, and you will be able to ping the developer and debug the issue right away.

Working with networks

  1. An app should operate in a stable manner under the following conditions:
    • When the connection is not stable;
    • When the connection is down;
    • When the connection speed is exceptionally low (1-2 Kb/s);
    • When there is no response from the server;
    • When the response from the server is wrong (i.e. it features errors or rubbish);
    • When the connection type changes on the fly (eg Wi-Fi—3G—4G—Wi-Fi).

    Use “problems with network” case chains to the full, using:

    • Customised router firmware. The Tomato firmware for the Linksys WRT54G used to help me a lot. The router was dirt cheap, and this firmware let you set the required Wi-Fi speed on the go, without losing the connection with the devices;
    • Proxy;
    • WANEm;
    • Network Link Conditioner can be easily installed on your Mac, and is built into iOS in version 6.0 and higher. It can shape the traffic and distribute it via an access point on both on an iOS devices and on Macs;
    • With Android, you can use preset connection speeds in an emulator, or tweak settings manually using netspeed.
  2. If you needs a proxy server, the easiest solution is CharlesProxy (it features manuals for devices and emulators on iOS and Android, supports a binary protocols, rewriting and traffic throttling, and, in short, is worth every penny you pay for it) or Fiddler (this one is free).

Working with application data, internal and external services

  1. If there is a third party service, it will definitely fail. A recent FB failure affected the performance of some applications and websites. You should detect as many such issues as possible in advance, and think up ways of eliminating them: to anticipate processing of unexpected responses (errors, rubbish, lack of response, null responses) from third-party services and provide feedback about an issue, to add time-outs to necessary requests, etc.
  2. If you have third-party libraries, they will definitely cause problems. Twitter, PayPal and Facebook SDK do have bugs. One of the Twitter SDK versions, for example, used to crash when it received error 503 from its own back-end; the library would just crash and cause the application to crash in turn. It’s not uncommon for the Facebook SDK to crash in Android (you can sometimes spot com.facebook.katana process in crash alerts).
  3. URI and data parsers should consider all possible unexpected situations possible; there have been cases when an automatic file and/or URI validity check stopped working on the server side, and the applications had to collect information the hard way. Examples:
    • A 404 error is returned in HTML format;
    • A null response is returned;
    • The server responds to an API request with a standard web-server stub (“It works!” in the case of nginx);
    • A null data structure is returned (JSON, XML, PLIST);
    • A wrong data structure was returned (HTML instead of XML);
    • An invalid URI or an URI with a wrong direction is returned.

    In all these cases, an app may fail to parse an unexpected response and crash.

    Apart from working on the application stability, in the aforementioned cases it’s important to give the user some visual feedback: an alert, a toast notification, placeholders instead of data, etc.

  4. If your app updates data by static or easily-formed URLs, then you can use Dropbox or Google Drive in cases where the server logic is not ready or being tuned up. Uploading and updating files directly on the device is not very pleasant. So here’s what we did:
    • We made all URLs configurable and set them to a separate entity, so that a team of developers or testers could easily reassemble apps with particular URLs for updatable data;
    • In addition, we changed all the necessary files and substituted them for the existing ones (manually or by means of the easiest scripts). It’s possible to write another script to rollback to the previous version, writing reference files (you can also use file versioning as provided by Dropbox).
  5. Don’t forget to check data and cache migration when an app is updated. It‘s important to bear in mind that users can skip versions, and that we should check updates of earlier versions as well. For example, in June 2015 the LinkedIn app used to crash at startup: some users were unable to use the app until the new version was released (fortunately, it was released on the same day).

Android

  1. Set customised screen resolutions from an emulator screen: it will help identify layout problems, in case you lack devices or just want to check whether the layout has been written correctly. In addition, you can edit screen resolution and pixel density via ADB and on a physical device, on the Nexus 10, for instance.
  2. If the keyboard is overridden (i.e. a custom one is used), pay close attention to this. There are both non-bypassable keyboard errors and logic/graphical errors.
  3. Staged rollout will easily help you pinpoint problems that can be overlooked while testing a release version: you can do a 5-10% release, monitor graphs and crashes and, if necessary, perform a rollback or a resubmit a fixed version.
  4. Use do not keep activities when testing, and make sure that applications are ready for unexpected finishing of activities, which can lead to crashes or data loss.

iOS

  1. Check whether standard gestures have been overridden. For example, activation of “Universal Access” activates additional gestures that can conflict with those in your app (three- and four-finger gestures, for instance).
  2. Also pay attention to third-party keyboards. For example, iOS9 features a bug that results in a crashed app, if you type text with a third-party keyboard in a WebView modal window.
  3. Show the rollout.io service to your developers. It lets you patch some crashes in production, redefine parameters, show alerts with apologies and disable certain buttons. It used to save our lives all the time.
  4. To perform interactive layout testing or to check that all screens have been removed from the hierarchy, you can use standard Xcode tools or Spark Inspector, RevealApp.
  5. Please ask your colleagues to integrate Memory Warning calls into the debugger menu.They are usually assigned to a particular gesture (tapping with several fingers, pressing a status or navigation bar) or to volume control buttons. You need to check appropriate app behaviour after a Memory Warning — does it clean up used resources, and if so, is it done properly?
    • For example, we had a nasty bug where our Image service would offload a picture from RAM after a Memory Warning, so the user got to see a placeholder instead.

Debugging processes

These tips will help you make progress with mobile apps testing faster and teach you how to bypass hidden pitfalls in communication with developers.

  1. Introduce a Pre-QA culture. Prior to sending a ticket to be reviewed, take a seat next to the developer, at their computer, and test the feature with debugger connected for 5-10 minutes. The majority of the silliest mistakes will show up immediately. This will also teach developers the basic testing skills: at worst, they will carry on doing what you showed them to do; and at best, they will dig deeper and start testing more responsibly. No one wants to make silly mistakes in public.
  2. Take at least a quick glance at diffs in every branch/feature, and ask developers as many questions as possible. In this way, you increase your authority as a tester: it shows, you are trying to understand the code and areas that are relevant to this feature. To this day, developers sometimes still see mobile app testers as monkeys who just poke at phone screens and juggle with devices to make the app crash. If there are no developers available, you can act as a reviewer. Sometimes a developer, while explaining how a feature works, finds bugs or cases they failed to consider. Secondly, you will gradually learn the programming language and get a better understanding of what’s going on under the hood of the application.
  3. Study the lifecycle of entities in the app (Activity for Android 1, 2, 3; ViewController for iOS 1, 2, 3) in order to understand the states into which an app screen and the app itself can transition. The better you know the application/ecosystem from the inside, the better you will be able to test it.
  4. If you have apps for iOS and Android, it is important to keep the right resource balance during testing.

Bugs in applications can result in re-submission, which have predictable consequences. Also be aware that the cost of errors is often lower on the Android platform.

  • Android features staged rollout. You can re-release an Android application the same very day, or even roll back a staged rollout (up to 50% can be completely rolled back to the previous version). But you shouldn’t re-release too often: since users will start complaining and might give you bad reviews.
  • For iOS, the best way of resubmitting is via expedited review (you should definitely not abuse this). The application will be re-released:
  • on the same day at the earliest (it usually goes to “in review” on the same day, but won’t be “available for review” until the next day);
  • at worst (if expedited review is not allowed) it can take 5–10 days.

On the otherhand, iOS applications are faster to test, since their ecosystem is not as fragmented as Android’s is.

Miscellaneous

  1. What if the worst happens when you least expect it, and a non-stable version of the app gets into production? We use a system of update screens to speed up user migration. Such a system can be useful in the following instances:
    • In the case of a critical bug that gets overlooked during development and testing;
    • when we need to update an application to the required version quickly in order to:
      • launch the feature on all platforms at the same time (it’s also helpful in cases where changes break backward compatibility);
      • to get faster more consistent A/B testing;
      • to take pressure off server teams who have to support outdated API versions because a number of users keep using (very) old app versions.

    Our update system operates in two modes:

    • Soft update, where the screen features “Update” and “Skip” buttons. The screen can be hidden for 24 hours. Also, in this mode you can ask users to enable automatic iOS and Android app updates in system settings, since some users disable automatic updates.
    • Hard update, where the screen shows only the “Update” button, leading you directly to the app’s page in a store.

    Not all users are physically able to update apps, so this method will be intentionally disabled in some versions or for some cases. iPhone4 users, for instance, will not be able to upgrade to iOS8, and we are planning to stop iOS7 support of the app.

  2. We need to monitor the following key application metrics on production:
    • Daily/monthly/… active users graphs in order to respond to emergencies faster;
    • Systems to collect and analyse crash logs: Crashlytics (now part of Twitter Fabric), HockeyApp, Crittercism, BugSense (now part of Splunk);
    • User feedback systems via an app (built-in feedback forms or email submission) with a way of attaching device descriptions and screenshots;
    • Application usage statistics (GoogleAnalytics, Flurry, Splunk, Heatmaps.io, MixPanel);
    • Digests of downloads, feedback grouping, finding out whether an app has been featured somewhere (AppAnnie, bought by Distimo).
  3. Sometimes only users with particular devices, or in particular countries, experience errors. Vodafone UK, for instance, had issues with WebP-images. You can use cloud-based device rental solutions to check cases like these: DeviceAnywhere (paid service), PerfectoMobile (paid service), Samsung Device Lab (free service, but features a system of credits that can be replenished over time).
  4. In addition, you should bear user time zone and location in mind. It well may be although your app is not intended to work in some countries, it have been released there by mistake, or perhaps the app user moved to another country. In iOS, location can be faked in simulator settings (Debug > Locations); there are also Android applications that let you do the same thing. If an application works with some data, and there are several data centres in different time zones, you should make sure that everything works properly and that there are no collisions when users switch data centres while travelling.
  5. You should learn to update and downgrade firmware, since platforms are fragmented (Android and Blackberry in particular). Cloud-based services are good, but they cost money, and not every company can use them because of budgetary constraints or security policies.
  6. So you’ve detected a bug after releasing a feature, and need to re-release the new version? Enabling, disabling, and modifying features on the go will help you. You can disable many features in our apps directly from the server depending on the company’s decision, via a dedicated interface.

Conclusion

Such a list of tips, approaches and tools can be useful for both for beginners and advanced testers of mobile apps. I hope that developers and managers alike will find something useful here for them as well.

When we were compiling this list, we were guided by our own experience, and we’d love to hear your opinion, which you can send us via comments below.

Winium.Desktop - Selenium for Windows-based Desktop Applications

$
0
0

Hi there. My name is Gleb, and I do test automation in 2GIS. Over a year ago I published an article about one of our tools (Cruciatus). We use it to test user interface for Windows-based desktop applications.

Cruciatus perfectly solves the problem of access to controls, but the tests, however, has to be written in C#. This interferes with sharing knowledge and experience among testers for various platforms: mobile, web and desktop.

We have found our solution in Selenium, which is probably the best known tool for automated testing. In this article I will tell you about our experience in crossbreeding Cruciatus and Selenium, and about testing Windows-based desktop applications using well-known Selenium bindings.


Why was Cruciatus not enough

Almost all teams that were dealing with internal 2GIS products used Cruciatus. Each of the teams suggested improvements for the tool. So, in order to please everyone, we have reworked the Cruciatus logic completely, together with ruining its reverse compatibility. It was painful, but useful.

Besides, we have abandoned Mouse and Keyboard classes from CodedUI in order to eliminate dependence on libraries that are delivered together with Visual Studio. It means that we have learned to assemble a project on public CI servers like AppVeyor.

As a result, we have created a convenient and a self-contained tool that solves all our problems with access to Windows-based desktop applications. However, Cruciatus still has one crucial limitation, namely C# dictatorship.


Coming to Selenium

Selenium is a set of tools and libraries to automate testing apps in browsers. The core of a Selenium project is Json Wire Protocol(JSWP), a single REST protocol of interaction among tests and the app being tested.

Benefits of the single protocol:

  • Tests can run on all platforms and in all browsers.
  • Engineers can code them in any language. There are Selenium bindings for Python, C#, Java, JavaScript, Ruby, PHP, Perl. One can develop bindings in other languages on their own.
  • Same commands work for different types of applications. For a test, a click in a mobile interface is the same as a click in a web interface.

We decided to use these advantages to automate testing desktop applications in the same manner as we use them for web.


What is Winium.Desktop?

In order to avoid C# dictatorship, we have made a Selenium-compatible wrapper for Cruciatus. At the same time our company was busy creating Selenium-compatible tools for automatic tests, which would run with mobile Windows applications. We have united our insights under the name of Winium and called our tool Winium.Desktop.

Technically, Winium.Desktop is an http client. It implements JSWP protocol and uses Cruciatus to work with UI elements. Essentially, this is an implementation of WebDriver for Windows-based desktop applications.

We use regular Selenium bindings with Winium.Desktop in order to test Windows-based desktop applications.


Working with Winium.Desktop

In order to start working with Winium.Desktop you should download the latest driver release from github and run it as administrator. This is not a mandatory provision, but otherwise you might run into an ‘Access denied’ message either from OS or from the application itself.

It’s all ready to go. Pick your favourite language and favourite IDE now and code tests in the same manner as you would do it for a web application. If you are not experienced with Selenium yet, read any manual on it. We would recommend you to start with Selenium Python Bindings.

The only difference from testing web applications is the following: in order to get to know elements’ locators one should use tools like UISpy or UI Automation Verify. We’ll discuss them in details below.

When you start the tests, don’t move your mouse and your keyboard: you move the cursor, the focus shifts, and the automated magic won’t work.


What a driver can do

In our Json Wire Protocol implementation we were guided by two drafts of a protocol that are used by WebDriver: JsonWireProtocol and a newer webdriver-spec. By now we have implemented the majority of the most popular commands.

Click to show the complete list
CommandQuery
NewSessionPOST /session
FindElementPOST /session/:sessionId/element
FindChildElementPOST /session/:sessionId/element/:id/element
ClickElementPOST /session/:sessionId/element/:id/click
SendKeysToElementPOST /session/:sessionId/element/:id/value
GetElementTextGET /session/:sessionId/element/:id/text
GetElementAttributeGET /session/:sessionId/element/:id/attribute/:name
QuitDELETE /session/:sessionId
ClearElementPOST /session/:sessionId/element/:id/clear
CloseDELETE /session/:sessionId/window
ElementEqualsGET /session/:sessionId/element/:id/equals/:other
ExecuteScriptPOST /session/:sessionId/execute
FindChildElementsPOST /session/:sessionId/element/:id/elements
FindElementsPOST /session/:sessionId/elements
GetActiveElementPOST /session/:sessionId/element/active
GetElementSizeGET /session/:sessionId/element/:id/size
ImplicitlyWaitPOST /session/:sessionId/timeouts/implicit_wait
IsElementDisplayedGET /session/:sessionId/element/:id/displayed
IsElementEnabledGET /session/:sessionId/element/:id/enabled
IsElementSelectedGET /session/:sessionId/element/:id/selected
MouseClickPOST /session/:sessionId/click
MouseDoubleClickPOST /session/:sessionId/doubleclick
MouseMoveToPOST /session/:sessionId/moveto
ScreenshotGET /session/:sessionId/screenshot
SendKeysToActiveElementPOST /session/:sessionId/keys
StatusGET /status
SubmitElementPOST /session/:sessionId/element/:id/submit

Here’s an example of some simple commands’ usage (Python):

  1. When a driver is being created, we launch the app with a NewSession command:
    driver=webdriver.Remote(command_executor='http://localhost:9999',desired_capabilities={"app":r"C:/windows/system32/calc.exe"})
  2. We then find the window of the application to be tested using a FindElement command:
    window=driver.find_element_by_class_name('CalcFrame')
  3. After that we find the element in the window with a FindChildElement command:
    result_field=window.find_element_by_id('150')
  4. Then we get an element property with a GetElementAttribute command:
    result_field.get_attribute('Name')
  5. Finally, we close the application with a Quit command:
    driver.quit()

Same in C#:

vardc=newDesiredCapabilities();dc.SetCapability("app",@"C:/windows/system32/calc.exe");vardriver=newRemoteWebDriver(newUri("http://localhost:9999"),dc);varwindow=driver.FindElementByClassName("CalcFrame");resultField=window.FindElement(By.Id("150"));resultField.GetAttribute("Name");driver.Quit();

You can get more detailed information about the supported commands on our wiki, which is located in the project repository.


Working with elements

In order to control elements in tests one should be able to find them first. The elements are searched by locators (properties that are unique to elements).

You can get the elements’ locators by UISpy, Inspect (newer version of UISpy) or UIAVerify. The latter two are installed together with Visual Studio, and located in “%PROGRAMFILES(X86)%\Windows Kits\8.1\bin\” directory (it can be different for different versions of Windows Kits).

It is recommended to launch any of those tools as administrator. We recommend you to use UIAVerify. It has the highest efficiency and is the most convenient one, in our opinion.

Cruciatus can search for elements by any property from AutomationElementIdentifiers class, Winium.Desktop supports only three search strategies (locator types):

  • AutomationProperties.AutomationId;
  • Name;
  • ClassName.

Desktop is the root element of the search. It is recommended first to look for a window of the application to be tested (FindElement) and only afterwards search for its child elements (FindChildElement).

In this case it is necessary to expand the available search strategies, email us, or do not hesitate to create a new issue.

Example. A code that can code.

fromseleniumimportwebdriverfromselenium.webdriverimportActionChainsimporttimedriver=webdriver.Remote(command_executor='http://localhost:9999',desired_capabilities={'app':r'C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\devenv.exe'})window=driver.find_element_by_id('VisualStudioMainWindow')menu_bar=window.find_element_by_id('MenuBar')menu_bar.click()menu_bar.find_element_by_name('File').click()menu_bar.find_element_by_name('New').click()menu_bar.find_element_by_name('Project...').click()project_name='SpecialForHabrahabr-'+str(time.time())new_project_win=window.find_element_by_name('New Project')new_project_win.find_element_by_id('Windows Desktop').click()new_project_win.find_element_by_name('Console Application').click()new_project_win.find_element_by_id('txt_Name').send_keys(project_name)new_project_win.find_element_by_id('btn_OK').click()text_view=window.find_element_by_id('WpfTextView')text_view.send_keys('using System;{ENTER}{ENTER}')actions=ActionChains(driver)actions.send_keys('namespace Habrahabr{ENTER}')actions.send_keys('{{}{ENTER}')actions.send_keys('class Program{ENTER}')actions.send_keys('{{}{ENTER}')actions.send_keys('static void Main{(}string{[}{]} args{)}{ENTER}')actions.send_keys('{{}{ENTER}')actions.send_keys('Console.WriteLine{(}\"Hello Habrahabr\"{)};')actions.send_keys('^{F5}')actions.perform()


Continuous Integration for Winium.Desktop tests

Tests that are run by a Winium.Desktop driver are included into a CI project in a standard manner. A real or a virtual machine is necessary for that. When setting up a machine for CI one should follow some formalities.

First, the system needs a so-called active desktop. It is available on your computer or when connected by RDP. Note that, it is not allowed to minimize the connection window. One can use Autologon for an automated creation of an active desktop.

Second, you must keep the desktop active. In order to do that you should set up the machine power settings (for the user who uses Autologon). Cancel the display power-off and hibernation mode. If you are using an RDP connection, you should reboot the machine when the connection is done. It will resume the active desktop. In order to keep an eye on the tests one can use System Center App Controller or VNC.

Third, your build server agent must run as a process, not as a service. This limitation is due to the fact that services in Windows are not allowed to run a user interface applications (i.e. services can not access desktop).

To sum it up: set Autologon up, keep the desktop active and run the build server agent as a process.


Conclusion

Winium.Desktop project enabled us to blur the distinction between automated UI testing of web and desktop applications.

Testers are now free to share their Selenium experiences and best practices. The automated tests that have been written for totally different platforms can be executed in one cloud infrastructure based on Selenium-Grid.

And again: a link to repository and other opensource 2GIS products.

Gleb Golovin, 2GIS.

Create a plugin for Google Protocol Buffer

$
0
0

Google’s Protocol Buffer is a library to encode and decode messages in a binary format optimised to be compact and portable between different platforms.

At the moment the core library can generate code for C/C++, Java and Python but additional languages can be implemented by writing a plugin for the Protobuf’s compiler.

There is already a list of plugins that support third party languages, however they directly translate the .proto files into the target language code, which then makes it possible to add business logic to the generated code.

In my case we wanted to have more control of what we generate and include some logic as well, so we decided to write our code generation plugin.

This post is a simple example of a plugin written in Python, which can be used as starting point for any other Google Protocol Buffer plugin.

What we’re going to build

In this post we are going to build and understand step by step:

  • an interface between our code and the Protobuf compiler
  • a parser for .proto data structure
  • the output of our generated code

Environment setup

Before start writing the plugin we need to install the Protocol Buffer compiler first to be able to compile any .proto file:

apt-get install protobuf

and then the Python Protobuf package to implement our plugin:

pip install protobuf

Writing the plugin

The interface between the protoc compiler is pretty simple: the compiler will pass a CodeGeneratorRequest message on the stdin and your plugin will output the generated code in a CodeGeneratorResponse on the stdout. So the first step is to write the code which reads the request and write an empty response:

#!/usr/bin/env pythonimportsysfromgoogle.protobuf.compilerimportplugin_pb2asplugindefgenerate_code(request,response):passif__name__=='__main__':# Read request message from stdindata=sys.stdin.read()# Parse requestrequest=plugin.CodeGeneratorRequest()request.ParseFromString(data)# Create responseresponse=plugin.CodeGeneratorResponse()# Generate codegenerate_code(request,response)# Serialise response messageoutput=response.SerializeToString()# Write to stdoutsys.stdout.write(output)

The protoc compiler follows a naming convention for the name of the plugins, as state [protobuf-plugin][here] you can save the code above in a file called protoc-gen-custom in your PATH or save it with any name you prefer (like my-plugin.py) and pass the plugin’s name and path to the --plugin command line option.

We are choosing the second option - passing the full path of our plugin to the --plugin command line option - because it will be much easier to pass a full path to our plugin instead of putting it into the PATH and it will make the entire compiler invocation more explicit.

So we’ll save our plugin as my-plugin.py and then then compiler’s invocation will looks like this (assuming that the build directory already exists)::

protoc --plugin=protoc-gen-custom=my-plugin.py --custom_out=./build hello.proto

The content of hello.proto file is simply this:

enumGreeting{NONE=0;MR=1;MRS=2;MISS=3;}messageHello{requiredGreetinggreeting=1;requiredstringname=2;}

The command above will not generate any output because our plugin does nothing. Now it’s time to write some meaningful output.

Generating code

Lets modify the generate_code() function to generate a JSON representation of the .proto file. First we need a function to traverse the AST - the Abstract Syntax Tree of the input .proto file - and return all the enumerators, messages and (nested types)[https://developers.google.com/protocol-buffers/docs/proto#nested):

deftraverse(proto_file):def_traverse(package,items):foriteminitems:yielditem,packageifisinstance(item,DescriptorProto):forenuminitem.enum_type:yieldenum,packagefornestedinitem.nested_type:nested_package=package+item.namefornested_itemin_traverse(nested,nested_package):yieldnested_item,nested_packagereturnitertools.chain(_traverse(proto_file.package,proto_file.enum_type),_traverse(proto_file.package,proto_file.message_type),)

And now the new generate_code() function:

importitertoolsimportjsonfromgoogle.protobuf.descriptor_pb2importDescriptorProto,EnumDescriptorProtodefgenerate_code(request,response):forproto_fileinrequest.proto_file:output=[]# Parse requestforitem,packageintraverse(proto_file):data={'package':proto_file.packageor'&lt;root&gt;','filename':proto_file.name,'name':item.name,}ifisinstance(item,DescriptorProto):data.update({'type':'Message','properties':[{'name':f.name,'type':int(f.type)}forfinitem.field]})elifisinstance(item,EnumDescriptorProto):data.update({'type':'Enum','values':[{'name':v.name,'value':v.number}forvinitem.value]})output.append(data)# Fill responsef=response.file.add()f.name=proto_file.name+'.json'f.content=json.dumps(output,indent=2)

For every .proto file in the request we iterate over all the items (enumerators, messages and nested types). We store the metadata about any messages and enumerators we encounter during the AST traversal into a dictionary-like data structure which will be used later for generating the output.

We then add a new file to the response and we set the filename. In this case it is equal to the original filename plus the .json extension, and the content which is the JSON representation of the dictionary.

If you run again the protobuf compiler it will output a file named hello.proto.json in the build directory with this content:

[{"type":"Enum","filename":"hello.proto","values":[{"name":"NONE","value":0},{"name":"MR","value":1},{"name":"MRS","value":2},{"name":"MISS","value":3}],"name":"Greeting","package":"&lt;root&gt;"},{"properties":[{"type":14,"name":"greeting"},{"type":9,"name":"name"}],"filename":"hello.proto","type":"Message","name":"Hello","package":"&lt;root&gt;"}]

Conclusion

In this post we walked through the creation of a Protocol Buffer plugin to compile a .proto file into simplified representation in JSON format. The core being the interface code to read a request from the stdin, traverse the AST and write the response on the stdout.

The most challenging part was to figure out how the information about the Protobuf data is passed to the plugin and back to the compiler. I was expecting a kind of common data format like JSON or XML instead a custom binary data structure is used instead. This was where I spent most of the time building the first plugin prototype but thanks to the list of plugin examples I was able to understand the plugin/compiler communication.

You are not limited to only transforming the input into another format, you can also use the request to output any code in any language, you can parse a .proto file and output code for a RESTful API in Node.js, converting the message and enum definitions into a XML file or even generate another .proto file i. e. without the deprecated fields.


Winium - Now for Windows Phone

$
0
0

We’re delighted to announce that we are publishing guest writer, Nick Abalov’s first article on our blog. Nick has been kind enough to share his work with us.

There are no convenient open source automated testing tools for Windows Phone and Windows. The existing tools are proprietary, have limitations and suggest their own approach that would differ from conventional standards like Selenium WebDriver. In this post we will present Selenium compatible open source tools for Windows Phone automation.

A colleague of mine, skyline-gleb, has recently published a post on Badoo Tech about our own Selenium-like tool for automated functional testing for Windows-based desktop applications. At the same time we were also developing a similar tool for Microsoft mobile platforms.

This article will tell you about the story behind this tool, advantages of a single automated testing platform for all mobile platforms, and about how to implement it within your projects.

Let me provide you with some background.

  • October 2010. Windows Phone 7 was released. A year later Expensify released WindowsPhoneTestFramework– an open source BDD tool to test native applications.
  • October 2012. Windows Phone 8 was released. Microsoft still did not release a tool for testing via UI.
  • February—March 2014. We published the first prototype of WinphoneDrvier, the first open source Selenium implementation for native Silverlight Windows Phone-based applications.
  • In April 2014 Microsoft released Windows Phone 8.1. Almost 4 years later than expected, they released official tools to test Windows Phone-based applications via UI: CodedUI. Unfortunately, this tool is not compatible with Selenium, and it is only available in the most expensive Visual Studio subscriptions.
  • In May 2014 Salesforce.com published windowsphonedriver (Selenium implementation to test web applications for Windows Phone) as open source. At almost the same time we have updated our driver to support Windows 8.1.
  • In February 2015 we published Winium.StoreApps as open source. It is an updated version of winphonedriver that implements a fair share of protocol commands and supports native StoreApps applications for Windows Phone 8.1. This is the driver we use in our processes.

Right afterwards we presented our tools on Codefest 2015. We had an informal talk there with Sathish Gogineni from Badoo that developed into an idea of Winium CodedUI, implementation of a Selenium driver based on CodedUI that would support native and hybrid applications, and—last but not least—direct tests on devices.

When the project started there was only one open tool: Expensify/WindowsPhoneTestFramework. It did not suit us because it was incompatible with Selenium, had a non-standard API and was developed for BDD approach. Apart from that, it is customised to BDD. In the course of our project development Microsoft implemented their own tool called CodedUI. Again, it had its own non-standard API, was customised for coding tests in Visual Studio on C#, was closed source and not free (which is not easy to scale).

So, that was a bit of a retrospective journey. Back to Winium. Since the tools we mentioned did not suit us, we decided to make a tool of our own. That is how the Winium project was born. It started as a tool of automated testing for Windows Phone Silverlight applications and soon turned into a comprehensive set of automated testing tools for a Windows platform:

We have already discussed Winium.Desktop and Winium.Cruciatus on Habrahabr and Badoo tech blog posts. Today we will discuss Winium for Store Apps (descendant of Windows Phone Driver) and about Winium.StoreApps.CodedUi.


Winium.StoreApps

Main features and the limitations of the driver.

Winium.StoreApps: the main driver implementation for mobile applications. We actively use and develop it in our day to day processes. The source code is open and available on GitHub.

Main features:

  • It implements Selenium protocol to test native StoreApps applications for Windows Phone platform
  • It works with Json Wire Protocol; one can use Selenium- or Appium bindings and code tests in the same manner as for iOS or Android
  • It supports installation and launch of the app being tested as well as uploading files to a local storage on the app
  • It supports single-touch gestures
  • It provides a basic inspector to view an UI tree of the tested application
  • It supports Selenium Grid, which lets us parallelise the tests run.

Limitations:

  • Only emulators are supported (with some changes, though the driver can install the app on the device and work with it, but we still don’t know how one can provide a fully functional simulation of gestures and input on the devices)
  • One must embed an automation server into the app under test (i.e., add a nuget package to the app and add one code line that would run the server on a separate thread. It definitely breaks the first Appium rule, but it is the only option, besides CodedUI, that we were able to find)
  • Only one session is supported, but one can connect the driver to Grid to distribute the tests and to run them in parallel.

Winium.StoreApps supports all main commands of Selenium, and it can be integrated into an existing infrastructure for testing (Selenium- or Appium-based). Generally, one can actively use it in a continuous process (that’s what we are doing).


How it all works.

Technically, Winium.StoreApps.Driver.exe is an HTTP server that implements JsonWire/WebDriver REST protocol. When necessary, Driver proxies the incoming commands to InnerServer (a test automation server) that is integrated into the application to be tested.

Interaction structure among tests, the driver and the tested app.

How to prepare an application and how to code tests.

One should follow three simple steps to launch tests against our application:

  • Prepare the application
  • Code tests (ok, this one is not that easy)
  • Run tests

Preparing an application

This is an easy one: we add a Winium.StoreApps.InnerServer nuget package and initialise automation server on the main thread after the UI is created. In MainPageOnLoaded method, for example. The server is initialised on the main thread only to get a correct dispatcher. The server will operate on another thread, apart from direct access to UI.

Then it would be really nice to enable testability of the app by specifying names and identifiers for the elements you are planning to use in test (this is not mandatory, however).

That’s it. The only thing you need to do now is to create an appx package with your application.


Coding tests

The tests are coded in the same way that web or mobile devices are with Selenium or Appium bindings.

The first thing you need to do is to create a new session. During its creation one can specify various properties. Here’s a list of basic ones that are supported by the driver (The complete list can be found on wiki).

dc={'deviceName':'Emulator 8.1 WVGA 4 inch 512MB','app':r'C:\YorAppUnderTest.appx','files':{{'C:\AppFiles\file1.png':'download\file1.png','C:\AppFiles\file2.png':'download\file2.png'}},'debugConnectToRunningApp':False}
  • deviceName is a partial name of a device where we will run our tests. If left empty, the first emulator from the list will be chosen.
  • app: a complete path to the appx package with the tested app featuring a built-in automation server.
  • files: a dictionary of files that would be uploaded from a local disk to local storage of the application.
  • debugConnectToRunningApp: allows you skip all steps of installation and file upload into the app and to connect to the running application. It may be convenient if you ran an application from a Visual Studio, set all the breakpoints and now you want to debug an error that occurs in the app during one of the tests.

Ok, we’ve created a session and run the app. Now we need to detect the elements if we need to interact with them. The driver supports the following locators:

  • id: AutomationProperties.AutomationId
  • name: AutomationProperties.Name
  • class name: full class name (Windows.UI.Xaml.Controls.TextBlock)
  • tag name: same as class name
  • xname: x:Name, a legacy locator. Normally it is not supported by default bindings and requires you to rework bindings for usage. At the same time, it enables search by name that is normally assigned to an element for access from code.

In order to enable the search of locators and comprehension of UI structure of the tested app in general, we have created an inspector that can connect to the application and display the current UI state as seen by the driver.

Inspector’s main window

There is not much the inspector can do now, but it does provide some basics like screenshot, UI tree as seen by a driver with all known elements, locators and their basic properties, like position, visibility and text this should help to get started with writing tests.

Ok, we have found the element. Now we can fiddle with it in any way we please.

# you can request text value of elementelement.text# you can click (tap) elementelement.click()# you can simulate user input into elementelement.send_keys('Hello!'+Keys.ENTER)# you can get any public property value of elementelement.get_attribute('Width')# можно запросить вложенное свойствоelement.get_attribute('DesiredSize.Width')# you can even get a complex property value as JSONelement.get_attribute('DesiredSize')# '{"Width":300.0,"Height":114.0,"IsEmpty":false}'

One click is not enough for a world of mobile phones. We need gestures. We are supporting the good old API from JSWP, and need to add support for a new Mobile WebDriver API soon. Feel free to make flicks and scrolls already.

TouchActions(driver).flick_element(element,0,500,100).perform()TouchActions(driver).scroll(200,200).perform()# you can even create custom gesturesActionChains(driver) \.click_and_hold() \.move_by_offset(100,100) \.release().perform()

Since we are integrating the automation server into the app, you will be able to do way more interesting tricks. Like calling MS Accessibility API commands:

# direct use of Windows Phone automation APIsapp_bar_button=driver.find_element_by_id('GoAppBarButton')driver.execute_script('automation: invoke',app_bar_button)list_box=driver.find_element_by_id('MyListBox')si={'v':'smallIncrement','count':10}driver.execute_script('automation: scroll',list_box,si)

It will let you use precise means of scrolling elements instead of gesture simulation. Besides, one can assign a value to a public property, though we do not recommend you do that in tests:

text_box=driver.find_element_by_id('MyTextBox')driver.execute_script('attribute: set',text_box,'Width',10)driver.execute_script('attribute: set',text_box,'Background.Opacity',0.3)

However, there are situations when this is justified. In our tests, for instance, we use this API instead of moving a map by gestures to move it precisely into the required position. This list of commands that are supported by the driver is far from being complete. The more detailed list and notes on commands can be found in wiki. Let us create a simple test based on all these commands: A code for a simple test

# coding: utf-8importosfromselenium.webdriverimportRemotefromselenium.webdriver.common.byimportByfromselenium.webdriver.supportimportexpected_conditionsasECclassTestAutoSuggestBox(object):defsetup_method(self,_):executor="http://localhost:{}".format(os.environ.get('WINIUM_PORT',9999))self.driver=Remote(command_executor=executor,desired_capabilities={'app':'aut.appx'})deftest_select_suggest(self,waiter):self.driver.execute_script("mobile: OnScreenKeyboard.Disable")pivots=self.driver.find_elements_by_class_name("Windows.UI.Xaml.Controls.Primitives.PivotHeaderItem")pivots[1].click()autosuggestion_box=waiter.until(EC.presence_of_element_located((By.ID,'MySuggestBox')))autosuggestion_input=autosuggestion_box.find_element_by_class_name('Windows.UI.Xaml.Controls.TextBox')autosuggestion_input.send_keys('A')suggestions_list=waiter.until(EC.presence_of_element_located((By.XNAME,'SuggestionsList')))suggestions=suggestions_list.find_elements_by_class_name('Windows.UI.Xaml.Controls.TextBlock')expected_text='A2'forsuggestinsuggestions:ifsuggest.text==expected_text:suggest.click()breakassertexpected_text==autosuggestion_input.textdefteardown_method(self,_):self.driver.quit()

We create a new session for this example. We hide the screen keyboard (both for a demonstration and to get it out of the way).

We switch to the second tab in the pivot element. We find an input field and start typing. Then we select one of the suggestions in the list and check if the value in the input field is the same as the prompt value. Then we close the session.

Starting the tests

Now you need to do the easiest part. Run Winium.StoreApps.Driver.exe (you can download this one from GitHub), run the tests by our favourite test runner and enjoy the magic.

Demo.

Winium CodedUI

Main features and limitations of the driver. The idea of creating a prototype for a Selenium driver that would wrap CodedUI came into our heads after Codefest 2015. We brought it to life, and the result is currently available on GitHub.

Main features:

  • It does not require to modify the tested app; you can even test pre-installed apps or apps that you downloaded from a store
  • It works both on emulators and on devices
  • It is compatible with Json Wire Protocol
  • It supports native applications
  • It has a limited support of hybrid apps already.

Limitations:

  • It is an early prototype, so some stability problems can be possible
  • A license for Visual Studio Premium or higher is required (for 2013, for 2015—Business)
  • One session only.

How it all works

It all operates based on the same principles as in StoreApps, but now instead of integrating a server into the app we launch the server as a separate background process via vs.test.console and CodedUI. This test server has got access to device and UI of the running applications directly via Accessibility API (in order to search for elements, for example) and via CodedUI API (for gestures, etc.).

Interaction structure among tests, the driver and the tested app.

How to prepare an application and how to code tests

Since this approach does not require amendments to the tested application, one can test both released app versions and pre-installed apps. That means that this version of the driver is as close as it can be to Appium philosophy. This is an advantage and a disadvantage at once, because it applies access restrictions to some internal parts of the app.

No particular app preparation is required. The tests are coded and run in the same manner as they are done for Winium.StoreApps.

This is a demo video where we automate creation of an event in a standard and pre-installed Calendar app. Example code

fromtimeimportsleepfromseleniumimportwebdriverfromselenium.common.exceptionsimportNoSuchElementExceptionfromselenium.webdriver.common.byimportByfromselenium.webdriver.supportimportexpected_conditionsfromselenium.webdriver.support.waitimportWebDriverWaitdeffind_element(driver,by,value):"""
    :rtype: selenium.webdriver.remote.webelement.WebElement"""returnWebDriverWait(driver,5).until(expected_conditions.presence_of_element_located((by,value)))winium_driver=webdriver.Remote(command_executor='http://localhost:9999',desired_capabilities={'deviceName':'Emulator','locale':'en-US',})# AutomationId for tiles can not be used to find tile directly,# but can be used to launch apps by switching to window# Actula tile_id is very very very long# {36F9FA1C-FDAD-4CF0-99EC-C03771ED741A}:x36f9fa1cyfdady4cf0y99ecyc03771ed741ax:Microsoft.MSCalendar_8wekyb3d8bbwe!x36f9fa1cyfdady4cf0y99ecyc03771ed741ax# but all we care about is part after last colonwinium_driver.switch_to.window('_:_:Microsoft.MSCalendar_8wekyb3d8bbwe!x36f9fa1cyfdady4cf0y99ecyc03771ed741ax')# accept permisson alert if anytry:accept_btn=winium_driver.find_element_by_name("allow")accept_btn.click()exceptNoSuchElementException:pass# now we are in calendar appnew_btn=find_element(winium_driver,By.NAME,"new")new_btn.click()sleep(1)# it all happens fast, lets add sleepssubject=find_element(winium_driver,By.ID,"EditCardSubjectFieldSimplified")subject.send_keys(u'Winium Coded UI Demo')sleep(1)# we should have searched for LocationFiled using name or something, but Calendar app uses slightly different# classes for location filed in 8.1 and 8.1 Update, searching by class works on bothlocation=winium_driver.find_elements_by_class_name('TextBox')[1]location.send_keys(u'Your computer')sleep(1)save_btn=find_element(winium_driver,By.NAME,"save")save_btn.click()sleep(2)winium_driver.close()winium_driver.quit()

How does Winium help us?

Is there an advantage of creating this tool instead of using platform-specific ones?

All our mobile app teams are using unified tools now. It lets us share experiences, build a single toolkit and infrastructure to launch our tests and not to disperse our attention to different platforms. Notably, one skilled engineer has successfully switched from iOS automation to Windows Phone, shared her experience and instructed our test engineers, which has significantly raised the project level.

In the context of infrastructure, it allowed us to focus on creating one tool (vmmaster) which would provide a reproducible test environment on demand. Well, it is a good subject for a new article.

The main thing is it gave us a possibility to start uniting our tests for various platforms in a single project (demo).

To sum it up, we now have:

  • Fewer wheels to reinvent
  • Less code duplication
  • Easier support
  • Better code quality
  • Sharing knowledge
  • Systematic approach
  • Faster development.

All of this is open source, of course, so feel free to use these tools to automate tests for Windows Phone apps. By the way, Winium.StoreApps is absolutely free. In order to use it you need to download the recent release and install emulators or a Visual Studio Community with a mobile SDK. Still, to use Winium.CodedUi you will need a paid version of Visual Studio Premium or higher. And again: a link to repository and other open source 2GIS products.

Nick Abalov, 2GIS.

Migrating to PHP 5.5 and Unit Tests

$
0
0

Four years have passed since our migration from PHP 4.4 to PHP 5.3 at Badoo. It is high time to upgrade to a newer PHP, PHP 5.5.

We’ve had many reasons to upgrade, apart from a number of new features, PHP 5.5 has greatly increased performance.

In this article, we will tell you about our migration to PHP 5.5, about traps we fell in, and about our new system to run unit tests based on PHPUnit.

Fig. 1. General architecture

Problems during migration from PHP 5.3 to PHP 5.5

Last time we were migrating from the fourth PHP version to the fifth one. Notably our PHP 5.3 version featured patches to make the “old” PHP syntax work, e.g. $a = &new ClassName();, and to make our codebase work simultaneously with PHP4 and PHP5. This time we did not have limitations like that, so during migration we just found all deprecated language constructs and replaced them with newer ones. Then we were done with rewriting the code.

The main problems we faced were the following:

  • Part of deprecated features of the language were removed
  • Mysql extension became deprecated
  • Low performance of runkit extension that we use for writing unit tests

After migrating to PHP 5.5, the execution of our unit tests’ became significantly longer, by several times in fact, so we decided to improve our test launcher one more time in order to resolve this issue.

Runkit and PHP 5.4+

Facebook’s xhprof extension quickly helped us to find out that our tests were running slowly due to the significantly reduced performance of the runkit extension, so we started to dig to find the cause of the problem. It was likely caused by adding a mysterious runtime cache for PHP 5.4 which had to be reset every time after a “runkit_*_redefine” function call.

The runkit extension runs through all loaded classes, methods and functions and resets the cache. We were naive enough to switch it off but it ended up crashing PHP, so we had to look for a different solution.

Microsuite concept

Prior to migrating to PHP 5.5 we already had a launcher for unit tests as a phpunit addin that would split one big suite of unit tests into several smaller ones. At that moment, we were already applying tests run in 11 threads

We carried out several simple benchmarks and found that the tests can be executed several times faster if the suite is split into 128 or more parts (with fixed number of processor cores), not into 11 ones as it used to be. Every suite resulted in around 10-15 files, so we called it a microsuite concept. We’ve got around 150 microsuites, and every one of them ran suspiciously smoothly to be a queue task (a task includes a list of files for a corresponding suite, which, in its turn, launches a phpunit unit with corresponding parameters).

Cloud tests

It turns out that the author of the article is not related to QA at all, but he was one of the main developers of a new script framework that is a sort of a “cloud” for scripts and supports the tasks’ concept (we gave talks about our cloud several times on conferences and we’ll definitely talk about it in detail on Habrahabr). Since we have tasks (file lists) for every phpunit suite, it means that we can put them into the cloud as well. Which is exactly what we’ve done. The idea is very simple: if we have multiple small tasks, they can run independently on several servers. Which means it would accelerate completion of tests even more.

General architecture

We run tests from several different sources:

  1. Automatic test runs using our automated deployment tool called AIDA:
    • By git branch of the task;
    • By build (the code that would go to production)
    • By master branch
  2. Manual test runs, initiated by developers or QA engineers from the dev-server.

All these test runs have one thing in common: the first one should fetch a branch from some source and then run the tests on this branch. This has defined the architecture of our new cloud-based test launcher (fig. 1 in the beginning of the article):

First, one task is created for a master process, which does the following:

  • Chooses an available directory in a database (fig. 2)
  • Downloads a git branch from a required spot (a central repository or a dev-server)
  • Runs git merge master (optional)
  • Creates a new commit with all the local changes (optional)

Fig. 2. List of available directories stored in MySQL

Then the master process analyses the original phpunit suite and splits it into as many parts as required (no more than 10 files for one microsuite). The resulting tasks (“thread” processes) are then added as tasks into the cloud and are run on servers that are available.

The first task to run on a new server prepares a selected directory for test run and fetches the required commit from the server where the master process is active. To prevent other tasks for the same server from the same actions the first task does, file locks are used (fig. 3).

In order to use resources from our cluster more fully, several test runs can be active at the same time: the tests run quickly, and executing the code takes much less time than preparation of source texts.

Fig. 3. Locks for directory preparation

Some tests can run significantly slower than the other ones. We have timing statistics for every test, so we use it to run the longer tests in the first place. This kind of strategy allows for a more uniform load of servers during testing and for reduction of the total test time (fig. 4).

Fig. 4. Time tracking for tests’ completion

If all is ok, our suite (consisting of 28 000 unit tests) is completed in 1 minute. The tests that last longer become a bottleneck, so the system places its authors on a hall of shame that is printed at every test run. Apart from that, if few tests are left, their list is also shown (fig. 5).

Fig. 5. Hall of shame: tests that are run longer than one minute

The unit test launcher became the first script to be moved onto the cloud. It helped to troubleshoot multiple bugs and faults in the cloud itself, which added to much higher speed of unit tests’ completion.


Summary

Migration to PHP 5.5 has allowed us to use new features of the language, has greatly reduced CPU load on our servers (average of 25% reduction) and moved our unit test launcher to the cloud. The last action let us reduce the total test timing from 5-6 minutes (and from dozens of minutes on PHP 5.5) to one minute, shifting the load from the general dev-server to the cloud.

Yury Nasretdinov, Badoo developer

Migrating to PHP 5.5 and Unit Tests

$
0
0

Four years have passed since our migration from PHP 4.4 to PHP 5.3 at Badoo. It is high time to upgrade to a newer PHP, PHP 5.5.

We’ve had many reasons to upgrade, apart from a number of new features, PHP 5.5 has greatly increased performance.

In this article, we will tell you about our migration to PHP 5.5, about traps we fell in, and about our new system to run unit tests based on PHPUnit.

Fig. 1. General architecture

Problems during migration from PHP 5.3 to PHP 5.5

Last time we were migrating from the fourth PHP version to the fifth one. Notably our PHP 5.3 version featured patches to make the “old” PHP syntax work, e.g. $a = &new ClassName();, and to make our codebase work simultaneously with PHP4 and PHP5. This time we did not have limitations like that, so during migration we just found all deprecated language constructs and replaced them with newer ones. Then we were done with rewriting the code.

The main problems we faced were the following:

  • Part of deprecated features of the language were removed
  • Mysql extension became deprecated
  • Low performance of runkit extension that we use for writing unit tests

After migrating to PHP 5.5, the execution of our unit tests’ became significantly longer, by several times in fact, so we decided to improve our test launcher one more time in order to resolve this issue.

Runkit and PHP 5.4+

Facebook’s xhprof extension quickly helped us to find out that our tests were running slowly due to the significantly reduced performance of the runkit extension, so we started to dig to find the cause of the problem. It was likely caused by adding a mysterious runtime cache for PHP 5.4 which had to be reset every time after a “runkit_*_redefine” function call.

The runkit extension runs through all loaded classes, methods and functions and resets the cache. We were naive enough to switch it off but it ended up crashing PHP, so we had to look for a different solution.

Microsuite concept

Prior to migrating to PHP 5.5 we already had a launcher for unit tests as a phpunit addin that would split one big suite of unit tests into several smaller ones. At that moment, we were already applying tests run in 11 threads

We carried out several simple benchmarks and found that the tests can be executed several times faster if the suite is split into 128 or more parts (with fixed number of processor cores), not into 11 ones as it used to be. Every suite resulted in around 10-15 files, so we called it a microsuite concept. We’ve got around 150 microsuites, and every one of them ran suspiciously smoothly to be a queue task (a task includes a list of files for a corresponding suite, which, in its turn, launches a phpunit unit with corresponding parameters).

Cloud tests

It turns out that the author of the article is not related to QA at all, but he was one of the main developers of a new script framework that is a sort of a “cloud” for scripts and supports the tasks’ concept (we gave talks about our cloud several times on conferences and we’ll definitely talk about it in detail on Habrahabr). Since we have tasks (file lists) for every phpunit suite, it means that we can put them into the cloud as well. Which is exactly what we’ve done. The idea is very simple: if we have multiple small tasks, they can run independently on several servers. Which means it would accelerate completion of tests even more.

General architecture

We run tests from several different sources:

  1. Automatic test runs using our automated deployment tool called AIDA:
    • By git branch of the task;
    • By build (the code that would go to production)
    • By master branch
  2. Manual test runs, initiated by developers or QA engineers from the dev-server.

All these test runs have one thing in common: the first one should fetch a branch from some source and then run the tests on this branch. This has defined the architecture of our new cloud-based test launcher (fig. 1 in the beginning of the article):

First, one task is created for a master process, which does the following:

  • Chooses an available directory in a database (fig. 2)
  • Downloads a git branch from a required spot (a central repository or a dev-server)
  • Runs git merge master (optional)
  • Creates a new commit with all the local changes (optional)

Fig. 2. List of available directories stored in MySQL

Then the master process analyses the original phpunit suite and splits it into as many parts as required (no more than 10 files for one microsuite). The resulting tasks (“thread” processes) are then added as tasks into the cloud and are run on servers that are available.

The first task to run on a new server prepares a selected directory for test run and fetches the required commit from the server where the master process is active. To prevent other tasks for the same server from the same actions the first task does, file locks are used (fig. 3).

In order to use resources from our cluster more fully, several test runs can be active at the same time: the tests run quickly, and executing the code takes much less time than preparation of source texts.

Fig. 3. Locks for directory preparation

Some tests can run significantly slower than the other ones. We have timing statistics for every test, so we use it to run the longer tests in the first place. This kind of strategy allows for a more uniform load of servers during testing and for reduction of the total test time (fig. 4).

Fig. 4. Time tracking for tests’ completion

If all is ok, our suite (consisting of 28 000 unit tests) is completed in 1 minute. The tests that last longer become a bottleneck, so the system places its authors on a hall of shame that is printed at every test run. Apart from that, if few tests are left, their list is also shown (fig. 5).

Fig. 5. Hall of shame: tests that are run longer than one minute

The unit test launcher became the first script to be moved onto the cloud. It helped to troubleshoot multiple bugs and faults in the cloud itself, which added to much higher speed of unit tests’ completion.


Summary

Migration to PHP 5.5 has allowed us to use new features of the language, has greatly reduced CPU load on our servers (average of 25% reduction) and moved our unit test launcher to the cloud. The last action let us reduce the total test timing from 5-6 minutes (and from dozens of minutes on PHP 5.5) to one minute, shifting the load from the general dev-server to the cloud.

Yury Nasretdinov, Badoo developer

How Badoo saved one million dollars switching to PHP7

$
0
0

Introduction

We did it! Hundreds of our application servers are now running on PHP7 and doing just fine. By all accounts, ours is only the second project of this scale (after Etsy) to switch to PHP7. During the process of switching over we found a couple bugs in the PHP7 bytecode cache system, but thankfully it’s all fixed now. Now we’re excited to share our good news with the whole PHP community: PHP7 is completely ready for production, stable, significantly reduces memory consumption, and improves performance dramatically.

In this article, we’ll discuss the process of switching over to PHP7 in detail, explaining what difficulties we encountered, how we dealt with them, and what the final results were. But first let’s step back a bit and look at some of the broader issues:

The idea that databases are a bottleneck in web-projects is an all-too-common misconception. A well designed system is balanced: when the input load increases, all parts of the system take the hit. Likewise, when a certain threshold is reached, all components – not just the hard disk database, but the processor and the network part – are hit. Given this reality, the processing power of the application cluster is arguably the most important factor. In many projects, this cluster is made up of hundreds or even thousands of servers, which is why taking the time to adjust the app cluster processing load more than justifies itself from the economic standpoint (by a million dollars in our case).

In PHP web apps, the processor consumes as much as any dynamic high-level language – a lot. But PHP developers have faced a particular obstacle (one that has made them the victims of vicious trolling from various communities): the absence of JIT or, at the very least, a generator of compilable texts in languages like C/C++. The inability of the PHP community to supply a similar solution within the frame of the core project fostered a suboptimal tendency: the main players started to slap together their own solutions. This is how HHVM was born at Facebook, KPHP at VKontakte, and maybe some other similar hacks. Thankfully, in 2015, PHP started to “grow up” with the release of PHP7. Though there is still no JIT, it’s hard to overestimate how significant these changes in the “engine” are. Now, even without JIT, PHP7 holds its own against HHVM (e.g. Benchmarks from the LightSpeed blog or PHP devs benchmarks). The new PHP7 architecture will even simplify the addition of JIT in the future.

Our “platform” developers at Badoo have paid careful attention to every hack to come out in recent years, including the HHVM pilot project, but we decided to wait for PHP7’s arrival given how promising it was. Now we’ve launched Badoo on PHP7! With over three million lines of PHP code and 60,000 tests, this project took on epic proportions. Keep reading to find out how we handled these challenges, came up with a new PHP app testing framework (which, by the way, is already open source), and saved a million bucks along the way.

Experimenting with HHVM

Before switching over to PHP7, we spent some time looking for other ways to optimize our backend. The first step was, of course, to play around with HHVM.

Having spent a few weeks experimenting, we got quite respectable results: after warming up JIT on our framework, we saw triple digit gains in speed and CPU use.

On the other hand, HHVM proved to have some serious drawbacks:

  • Deploying is difficult and slow. During deploy, you have to warm up the JIT-cache. While the machine is warming up, it shouldn’t be loaded down with production traffic, because everything goes pretty slowly. HHVM team also doesn’t recommend warming up parallel requests. By the way, the warm-up phase of a big cluster operation doesn’t go quickly. Additionally, for big clusters consisting of a few hundred machines, you have to learn how to deploy in batches. Thus the architecture and deploy procedure involved is substantial, and it’s difficult to tell how much time it will take ahead of time. For us, it’s important for deploy to be as simple and fast as possible. Our developer culture prides itself on putting out two planned releases a day and being able to roll out many hot fixes.
  • Inconvenient testing. We rely heavily on the runkit extension, which wasn’t available in HHVM. A bit later, we’ll go into more detail about runkit, but suffice it to say, it’s an extension that lets you change the behavior of variables, classes, methods, functions, practically whatever you want on the fly. This is accomplished via an integration that gets to the very “guts” of PHP. The HHVM engine bares only a faint resemblance to PHP’s, however, so their respective “guts” are quite different. Due to the extension’s particular features, implementing runkit independently on top of HHVM is insanely difficult and we had to rewrite tens of thousands of tests in order to be sure that HHVM was working correctly with our code. This just didn’t seem worthwhile. To be fair, we would later encounter this same problem with all other options at our disposal, and we still had to redo a lot of things including getting rid of runkit during the switch over to PHP7. But more about that later.
  • Compatibility. The main issues are incomplete compatibility with PHP5.5 (see: https://github.com/facebook/hhvm/blob/master/hphp/doc/inconsistencies, https://github.com/facebook/hhvm/issues?labels=php5+incompatibility&state=open) and incompatibility with existing extensions (of which we have dozens). Both of these incompatibilities are a result of an obvious drawback of the project: HHVM is not developed by the larger community, but rather within a division of Facebook. In situations like this, it’s easier for companies to change their internal rules and standards without referencing the community and volumes of code contained therein. In other words, they cut themselves off and solve the problem using their own resources. Therefore, in order to handle tasks of similar volume, a company needs to have Facebook-like resources to devote to both the initial implementation as well as continuing support. This proposition is both risky and potentially expensive, so we decided against it.
  • Potential. Even though Facebook is a huge company with numerous top-notch programmers, we doubted that their HHVM developers would prove more powerful than the entire PHP-community. We reckoned that as soon as something similar to HHVM appeared for PHP, the former would start to slowly fade out of use.

So we patiently awaited PHP7.

The switch to the new version of the interpreter was both an important and difficult process, and we prepared for it by putting together a precise plan. This plan consisted of three stages: - Changing the PHP build/deploy infrastructure and adapting the mass of extensions we’d already written - Changing the infrastructure and testing environments - Changing the PHP app code

We’ll get into the details of all these stages later.

Changes to the engine and extensions

At Badoo, we have our own actively supported and updated PHP branch, and we started switching over to PHP7 even before its official release, so we had to regularly rebase PHP7 upstream in our tree in order for it to update with every release candidate. All patches and customizations that we use in our everyday work also had to be ported between versions and work correctly.

We automated the process of downloading and building dependencies, extensions and the PHP tree for versions 5.5 and 7.0. This not only simplified our current work, but also bodes well for the future: when version 7.1 comes out, everything will be in place.

As mentioned, we also had to turn our attention to extensions. We support over 40 extensions, more than half of which are open source with our reworks.

In order to switch them over as quickly as possible, we decided to launch two parallel processes. The first involved individually rewriting the most critical extensions: the blitz template engine, data cache in shared memory/APCu, pinba statistics collector, and a few other custom extensions for internal services (in total, we used our forces to redo about 20 extensions).

The second involved actively ridding ourselves of all extensions that are only used in non-critical parts of the infrastructure in order to unclutter things as much as possible. We were easily able to get rid of 11 extensions, which is not an insignificant figure!

Additionally, we started to actively discuss PHP7 compatibility with those who maintain the main open extensions (special thanks to xdebug developer Derick Rethans).

We’ll go into more detail regarding the technical details of porting extensions to PHP7 a bit later.

Developers made a lot of changes to internal APIs in PHP7, which meant we had to alter a lot of extension code.

Here are the most important changes:

  • zval * -> zval. In earlier versions, the zval structure was always allocated for a new variable, but now a stack structure is used.
  • char * -> zend_string. Aggressive string caching in the PHP engine is used in version 7. For this reason, with the new engine there is a complete switch from regular strings to the zend_string structure where a string is stored along with its length.
  • Changes in array API. Now zend_string is used as a key and the array implementation substitutes a double linked list with an ordinary array that is highlighted by one block instead of a lot of smaller ones.

All this makes it possible to radically reduce the number of small memory allocations and, as a result, speed up the PHP engine by double digit percentage points.

We should note that all these changes made it necessary to at least alter all extensions (if not rewrite them completely). Though we could rely on the authors of built-in extensions to make the necessary changes, we of course were responsible for altering our own, and the amount of work was substantial. Due to changes to internal APIs, it was easier just to rewrite some sections of code.

Unfortunately, introducing new structures using garbage collection along with speeding up the code execution made the engine itself more complex and it became harder to locate problems. One such problem concerned OpCache. During cache flush, the cached file’s bytecode breaks down at the moment when it could be used in a different process, so the whole thing falls apart. Here’s how it looks from the outside: (zend_string) in function names or as a constant suddenly breaks down and garbage appears in its place.

Given that we use a lot of in-house extensions, many of which deal particularly with strings, we suspected that the problem was with how strings were used in them. We wrote a lot of tests and conducted plenty of experiments, but didn’t get the results we expected. Finally, we asked for help from the main PHP engine developer, Dmitri Stogov.

One of his first questions was “Did you clear the cache?” We explained that we had, in fact, cleared the cache every time. At that point, we realized that the problem was not on our end, but with OpCache. We quickly reproduced the case, which helped us to replay and fix the problem within a few days. Without this fix that came out in the 7.0.4 version, it wouldn’t have been possible to put PHP7 into stable production.

Changes to testing infrastructure

We take special pride in the testing we do at Badoo. We deploy server PHP code to production two times a day, and every deploy contains 20-50 tasks (we use feature branches in git and automated builds with tight JIRA integrations). Given this schedule and task volume, there’s no way we could go without autotests. Currently, we have around 60 thousand unit tests with about 50% coverage, which run for an average of 2-3 minutes in the cloud (see our article for more). In addition to unit tests, we use higher-level autotests, integration and system tests, selenium tests for web pages, and calabash tests for mobile apps. Taken as a whole, this allows us to quickly reach conclusions about the quality of each concrete version of code and apply the appropriate solution.

Switching to the new version of interpreter was a major change fraught with potential problems, so it was especially important that all tests worked. In order to clarify exactly what we did and how we managed to do it, let’s take a look at how test development has evolved over the years at Badoo.

Often, people starting to think about implementing product testing (or, in some cases, having started implementation already) discover that their code is “not ready for testing” during the experimentation process. For this reason, in most cases it’s important for the developer to keep in mind that his code should be testable while he’s writing it. The architecture should allow unit tests to replace calls and external dependency objects in order to isolate the code being tested from external conditions. Of course, it goes without saying that this is a much-hated requirement and many programmers take a stand against writing “testable” code out of principle. They feel that these restrictions flies in the face of “good code” and often don’t pay off. And you can imagine the sheer volume of code that’s not written “by the rules”, and results in testing being delayed “for a better time” or experimenters trying to satisfy themselves by running small tests that only cover what can be covered (which basically means the tests don’t yield the expected results).

I’m not trying to say that our company is an exception; we also didn’t implement testing right from the start of our project. There were several lines of code that worked fine in production and brought in cash, so it would have been stupid to rewrite them just to run tests (as recommended in the literature). That would take too long and be too expensive.

Fortunately, we already had an excellent tool that allowed us to solve the majority of our problems with “untestable code” - runkit. While the script is running, this PHP extension lets you change; delete; and add methods, classes, and functions used in the program. It also has many other functions, but we didn’t use them. This tool was developed and supported for many years, from 2005 to 2008 by Sara Goleman (who now works at Facebook and, interestingly enough, on HHVM). Beginning in 2008 and continuing through the present, it has been maintained by Dmitri Zenovich (who headed the testing division at Begun and Mail.ru). We’ve also done our bit to contribute to the project.

On its own, runkit is a very dangerous extension. It lets you change constants, functions, and classes while the script that uses them is running. In essence, it’s like a tool that let’s you rebuild a plane during the flight. Runkit gets right to the “guts” of PHP on the fly, but one mistake or deficiency makes everything go up in flames and either the PHP fails or you have to spend a lot of time searching for memory leaks or other low-level debugging. Nonetheless, this tool is essential for our testing: implementing project testing without having to do major rewrites can only be done by changing the code on the fly.

But runkit turned out to be a big problem during the switch to PHP7 because it didn’t support the new version. We could have sponsored the development of a new version, but, looking at the long-term perspective, this didn’t seem like the most reliable path to pursue. So we looked at a few other options.

One of the most promising solutions was to shift from runkit to uopz. The latter is also a PHP extension with similar functionality that launched in 2014. Our colleagues at Wamba suggested uopz, focusing on its impressive speed. The maintainer of the uopz project, by the way, is Joe Watkins (First Beat Media, UK). Unfortunately, however, switching all our tests to uopz didn’t work out. In some places there were fatal errors, in others - segfaults. We created a few reports but there was no movement on them, unfortunately (e.g. https://github.com/krakjoe/uopz/issues/18). Trying to deal with this situation by rewriting tests would have been very expensive, and more issues could very well have emerged even if we did.

Given that we had to rewrite a lot of code no matter what, and were dependent on external projects like runkit or uopz regardless of how problematic they were, we came to the obvious conclusion that we should rewrite our code to be as independent as possible. We also pledged to do everything we could to avoid similar problems in the future, even if we ended up switching to HHVM or any similar product. This is how we arrived at our own framework.

The system got the name “SoftMocks”, with “soft” highlighting the fact that the system works on clean PHP without the use of extensions. The project is open source and is available in the form of an add-on library. SoftMocks is not tied up with the particulars of PHP engine implementation and works by rewriting code “on the fly”, analogously to the Go AOP! framework.

Tests in our code primarily use the following:

  1. Implementation override of one of the class methods
  2. Function execution result override
  3. Changing the value of global constants or class constants
  4. Adding a method to a class

All these things are implemented successfully using runkit. Rewriting code makes all this possible with some reservations.

Though we don’t have space to go into much detail about SoftMocks in this article, we plan on devoting a separate article to this topic in the future. Here we’ll hit some of the main points:

  • Custom code is connected through the rewrite wrapper function. Then all include operators are automatically overridden as wrappers.
  • Checks for existing overrides are added inside every custom method definition. If they exist, then the corresponding code is executed. Direct function calls are replaced by calls through the wrapper; this lets us catch both built-in and custom functions.
  • Calls to the wrapper dynamically override access to constants in the code.
  • SoftMocks works with Nikita Popov’s PHP-Parser: This library isn’t very fast (parsing is about 15 times slower than token_get_all), but the interface lets you bypass the parse tree and includes a convenient API for handling syntactic constructions of indeterminate difficulty.

Now to get back to the main point of this article: the switch to PHP7. After switching the project over to SoftMocks, we still had about 1000 tests that we had to fix manually. You could say that this wasn’t a bad result, given that we started with 60,000 tests. By comparison with runkit, test run speeds didn’t decrease, so there are no performance issues with SoftMocks. To be fair, we should note that uopz is supposed to work significantly faster.

Utilities and app code

Though PHP7 contains many new developments, there are also a few issues with backward compatibility. The first thing we did to tackle these problems was read the official migration guide. From this it quickly became clear that without fixing the existing code, we were risking both getting fatal errors in production and encountering changes in code behavior that wouldn’t show up in logs, but would nonetheless cause the app to misbehave.

Badoo has several PHP code repositories, the biggest of which contains more than 2 million lines of code. Furthermore, we’ve implemented many different things in PHP, from our web business logic and backend for mobile apps, to testing utilities and code deploys. Our situation was further complicated by the fact that Badoo is a project with a long history; we’ve been around for ten years now and, unfortunately, the legacy of PHP4 is still very present. At Badoo, we don’t use the “just stare at it long enough” method of error detection. The so-called Brazilian System - whereby code is deployed in production as is and you have to wait and see where it breaks - carries too high a risk that the business logic will break down for a large percent of users and is thus also an unworkable option. For these reasons, we started looking for ways to automate the search for places of incompatibility.

Initially, we tried to use IDE’s, which are very popular among developers, but, unfortunately, they either don’t support the syntax and features of PHP7 yet, or they didn’t function well enough to find all the obviously dangerous places in the code. After conducting a bit of research (i.e. googling), we decided to try the php7mar utility, which is a static code analyzer implemented in PHP. This PHP7 utility is very simple to use, works fairly quickly, and gives you your results in a text file. Of course, it’s not a panacea; there are both false positives and failures to find particularly well-hidden problem spots. Despite this, the utility helped us root out about 90% of the problems, which dramatically sped up and simplified the process of getting the code ready for PHP7.

The most commonly encountered and potentially dangerous problems for us were the following:

  • Changes in the behavior of func_get_arg() and func_get_args(). In the 5th version of PHP, these functions return argument values at the moment of transmission, but in version seven this happens at the moment when func_get_args() is called. In other words, if the argument variable changes inside the function before func_get_args() is called, then the behavior of the code may differ from that of version five. The same thing happens when the app’s business logic breaks down, but there is nothing in the logs.
  • Indirect access to object variables, properties, and methods. And once again, the danger lies in the fact that the behavior can change “silently”. For those looking for more information, the differences between versions are described in detail here.
  • Use of reserved class names. In PHP7, you can no longer use bool, int, float, string, null, true and false as class names. And yeah, we had a Null class. Its absence actually makes things easier though, because it often resulted in errors.
  • Many potentially problematic foreach constructions that use a reference were found. Since we tried earlier not to change the iterable array inside foreach or count on its internal pointer though, practically all of them behaved the same in versions 5 and 7.

Remaining instances of incompatibility were either rarely encountered (like the ‘e’ modifier in regular expressions), or they were fixed with a simple replacement (for example, now all constructors should be named __construct(). Using the class name is not permitted).

But before we even started fixing the code, we were worried that as some developers were making the necessary compatibility changes, others would continue to write code that was incompatible with PHP7. To solve this issue, we put a pre-receive hook in every git-repository that executes php7 -l on changed files (in other words, that makes sure the syntax matches PHP7). This doesn’t guarantee that there won’t be any compatibility issues, but it does clear up a host of problems. In other cases, the developers just had to be more attentive. Besides that, we started to run the whole set of tests on PHP7 and compare them with the results on PHP5.

Additionally, developers were forbidden from using any new feature of PHP7, i.e. we didn’t disable the old pre-receive hook php5 -l. This allowed us to get code compatible with versions 5 and 7 of the interpreter. So why is this important? Because in addition to problems with the PHP code, there are potential issues with PHP7 and its extensions themselves (we can personally attest to them, as evidenced above). And unfortunately, not all problems were reproduced in the test environment; some we only saw under heavy load in production.

Launch into Battle and the Results

Clearly we needed a way to simply and quickly change PHP versions on any quantity and type of server. To enable this, all paths to a CLI-interpreter in all the code were replaced with /local/php, which, in turn, was a symlink to either /local/php5, or /local/php7. This way, to change PHP versions on the server, we had to change the link (an atomic operation is important for cli-scripts), stop php5-fpm and launch php7-fpm. In nginx, we could have had two upstreams for php-fpm and launched php5-fpm and php7-fpm on different ports, but we didn’t like the complicated nginx configuration.

After executing everything listed above, we switched to running Selenium tests in the pre-production environment, which turned up several problems that were not noticed earlier. These problems concerned both the PHP code (for example, we could no longer use the outdated global variable $HTTP_RAW_POST_DATA for the benefit of file_get_contents(“php://input”)), as well as extensions (where there were various types of segmentation errors).

After fixing the problems discovered at the earlier stage and rewriting the unit tests (during which we also found a few bugs in the interpreter like this), we finally went into what we call “quarantine” production. This is when we launch a new version of PHP on a limited number of servers. We started with one server in every major cluster (web backend, mobile apps backend, cloud) and increased the quantity little by little if no errors occurred. The first large cluster to completely switch to PHP7 was the cloud, because there are no php-fpm requirements for that cluster. The fpm clusters had to wait for us to find (and then for Dmitri Stogov to fix) the OpCache problem. After that was taken care of, we also switched over the fpm clusters.

Now for the results. In brief, they are really quite impressive. Here you see the rusage response time graph, memory consumption and processor use in the largest of our clusters (consisting of 263 servers), and the mobile apps backend in the Prague data center.

Request time distrubition:

RUsage (CPU time):

Memory usage:

CPU load (%) on the mobile backend cluster:

With all this in place, process time was cut in half, which improved overall response time by about 40% since a certain amount of request processing time is spent communicating with the database and daemons. Logically, we didn’t expect this part to speed up by switching to PHP7. Besides this, the overall load on the cluster fell below 50 percent thanks to Hyper-Threading technology, further contributing to the impressive results. In broad terms, when the load increases to over 50 percent, HT-engines, which aren’t as useful as physical engines, start working. But that’s already a topic for another article. Additionally, memory use, which had never been a bottleneck for us, was reduced by roughly eight times over! And finally, we saved on the number of machines. In other words, that number of servers can withstand a much greater load, which lowers expenditures associated with acquiring and servicing equipment. Results on the remaining clusters were similar, with the exception that gains on the cloud were a bit more modest (around 40% CPU) due to the lack of OpCache.

So how much money did we save? Let’s tally it up! An app server cluster at Badoo consists of a bit more than 600 servers. By cutting CPU usage in half, we free up around 300 servers. If you consider the initial price of this hardware is ~$4K and factor in depreciation, then it works out to about a million dollars in savings plus $100,000 a year in hosting! And this doesn’t even take the cloud into consideration, which also saw a performance boost. Thus we are very pleased with the results.

Have you also made the switch to PHP7? We’d like to hear your opinion and will be happy to answer questions left in the commentary.

Badoo Team

iOS Architecture Patterns

$
0
0

FYI: Slides from my presentation at NSLondon are available here.

Feeling weird while doing MVC in iOS? Have doubts about switching to MVVM? Heard about VIPER, but not sure if it worth it?

Keep reading to find the answers to questions above. However if you don’t — feel free to place your complaint in the comments.

You are about to structure your knowledge about architectural patterns in an iOS environment. We’ll briefly review some of the popular patterns and compare them in theory and practice, going over a few examples. Follow the links throughout the article if you would like to read in more detail about each pattern.

Mastering design patterns might be addictive, so beware: you might end up asking yourself more questions now than before reading this article, like these:

Who is supposed to own networking requests: a Model or a Controller?

How do I pass a Model into a View Model of a new View?

Who creates a new VIPER module: a Router or a Presenter?


Why care about choosing the architecture?

Choosing the right architecture is important, especially to simplify the debugging process. ”. Naturally, it is hard to keep big classes in mind as whole entity, thus, you’ll always be missing some important detail. If you are already in this situation with your application, it is very likely that:

  • This class is the UIViewController subclass.
  • Your data stored directly in the UIViewController
  • Your UIViews do almost nothing
  • The Model is a dumb data structure
  • Your Unit Tests cover nothing

And this can happen, despite the fact that you are following Apple’s guidelines and implementing Apple’s MVC pattern, so don’t feel bad. There is something wrong with the Apple’s MVC, but we’ll get back to it later.

Let’s define the features of a good architecture:

  1. Balanced distribution of responsibilities among entities with strict roles.
  2. Testability usually comes from the first feature (and don’t worry: it is easy with appropriate architecture).
  3. Ease of use and a low maintenance cost.

Why Distribution?

Having a balanced distribution doesn’t overload the brain, while trying to figure out how things work. If you think the more you develop the better your brain will adapt to understanding complexity, then you are right. But this ability doesn’t scale linearly and reaches the cap very quickly. So the easiest way to defeat complexity is to divide responsibilities among multiple entities following the single responsibility principle.

Why Testability?

This is usually not a question for those who already felt gratitude to unit tests, which failed after adding new features or due to refactoring some intricacies of the class. This means the tests saved those developers from finding issues in runtime, which might happen when an app is on a user’s device and the fix takes a week to reach the user.

Why Ease of use?

This does not require an answer but it is worth mentioning that the best code is the code that has never been written. Therefore the less code you have, the less bugs you have. This means that the desire to write less code should never be explained solely by laziness of a developer, and you should not favour a smarter solution closing your eyes to its maintenance cost.


MV(X) essentials

Nowadays we have many options when it comes to architecture design patterns:

The first three put the entities of the app into one of 3 categories:

  • Models  - responsible for the domain data or a data access layer which manipulates the data, think of ‘Person’ or ‘PersonDataProvider’ classes.
  • Views  -  responsible for the presentation layer (GUI), for iOS environment think of everything starting with ‘UI’ prefix.
  • Controller/Presenter/ViewModel -  the glue or the mediator between the Model and the View, in general responsible for altering the Model by reacting to the user’s actions performed on the View and updating the View with changes from the Model.

Having entities divided allows us to: - Understand them better (as we already know) - Reuse them (mostly applicable to the View and the Model) - Test them independently

Let’s start with MV(X) patterns and get back to VIPER later.

MVC

How it used to be

Before discussing Apple’s vision of MVC let’s have a look at the traditional one.

Traditional MVC

In this case, the View is stateless. It is simply rendered by the Controller once the Model is changed. Think of a web page, it completely reloads once you press on the link to navigate somewhere else. Although it is possible to implement the traditional MVC in iOS application, it doesn’t make much sense due to the architectural problem  -  all three entities are tightly coupled, each entity knows about the other two. This dramatically reduces reusability of each of them , which is not what you want to have in your application. For this reason, we skip even trying to write a canonical MVC example.

Traditional MVC doesn’t seems to be applicable to modern iOS development.


Apple’s MVC

Expectation

Cocoa MVC

The Controller is a mediator between the View and the Model so that they don’t know about each other. The least reusable is the Controller and this is usually fine for us, since we must have a place for all that tricky business logic that doesn’t fit into the Model.

In theory, it looks very straightforward, but you feel that something is wrong, right? You’ve even heard people unabbreviating MVC as the Massive View Controller. Moreover, view controller offloading became an important topic for the iOS developers. Why does this happen if Apple just took the traditional MVC and improved it a bit?


Apple’s MVC

Reality

Realistic Cocoa MVC

Cocoa MVC encourages you to write Massive View Controllers, because they are so involved in View’s life cycle that it’s hard to say they are separate. Although you still have ability to offload some of the business logic and data transformation to the Model, you don’t have much choice when it comes to offloading work to the View. Most of time all the responsibility of the View is to send actions to the Controller. The view controller ends up being a delegate and a data source of everything, and is usually responsible for dispatching and cancelling the network requests and… you name it.

How many times have you seen code like this:

varuserCell=tableView.dequeueReusableCellWithIdentifier("identifier")asUserCelluserCell.configureWithUser(user)

The cell, which is the View configured directly with the Model, so MVC guidelines are violated, but this happens all the time, and usually people don’t feel it is wrong. If you strictly follow the MVC, then you are supposed to configure the cell from the controller, but don’t pass the Model into the View as this will increase the size of your Controller even more.

Cocoa MVC is reasonably unabbreviated as the Massive View Controller.

The problem might not be evident until it comes to the Unit Testing (hopefully, it does in your project). Since your view controller is tightly coupled with the view, it becomes difficult to test because you have to be very creative in mocking views and their life cycle. At the same time the view controller’s code has to be written in such a way that your business logic is separated as much as possible from the view layout code.

Let’s have a look on the simple playground example:

importUIKitstructPerson{// ModelletfirstName:StringletlastName:String}classGreetingViewController:UIViewController{// View + Controllervarperson:Person!letshowGreetingButton=UIButton()letgreetingLabel=UILabel()overridefuncviewDidLoad(){super.viewDidLoad()self.showGreetingButton.addTarget(self,action:"didTapButton:",forControlEvents:.TouchUpInside)}funcdidTapButton(button:UIButton){letgreeting="Hello"+""+self.person.firstName+""+self.person.lastNameself.greetingLabel.text=greeting}// layout code goes here}// Assembling of MVCletmodel=Person(firstName:"David",lastName:"Blaine")letview=GreetingViewController()view.person=model;

MVC example

MVC assembling can be performed in the presenting view controller.

This doesn’t seem very testable, right? We can move the generation of greeting into the new GreetingModel class and test it separately, but we can’t test any presentation logic (although there is not much of such logic in the example above) inside the GreetingViewController without calling the UIView related methods directly (viewDidLoad, didTapButton). This might cause loading all views, and this is bad for the unit testing.

In fact, loading and testing UIViews on one simulator (e.g. iPhone 4S) doesn’t guarantee that it would work on the other devices (e.g. iPad). I’d recommend to remove “Host Application” from your Unit Test target configuration and run your tests without your application running on simulator.

The interactions between the View and the Controller aren’t really testable with Unit Tests

With all that said, it might seems that Cocoa MVC is a pretty bad pattern to choose. But let’s assess it in terms of features defined in the beginning of the article:

  • Distribution — the View and the Model in fact separated, but the View and the Controller are tightly coupled.
  • Testability — due to the bad distribution you’ll probably only test your Model.
  • Ease of use — the least amount of code among others patterns. In addition everyone is familiar with it, thus, it’s easily maintained even by the inexperienced developers.

Cocoa MVC is the pattern of your choice if you are not ready to invest more time in your architecture, and you feel that something with higher maintenance cost is an overkill for your tiny pet project.

Cocoa MVC is the best architectural pattern in terms of the speed of the development.

MVP

Cocoa MVC’s promises delivered

Passive View variant of MVP

Doesn’t it look exactly like the Apple’s MVC? Yes, it does, but it’s name is MVP (Passive View variant). But wait a minute… Does this mean that Apple’s MVC is in fact a MVP? No, it’s not, because if you recall, in Apple’s MVC, the View is tightly coupled with the Controller, while the MVP’s mediator, Presenter, has nothing to do with the life cycle of the view controller. The View can be mocked easily, so there is no layout code in the Presenter at all, but it is responsible for updating the View with data and state.

What if I told you, the UIViewController is the View.

In terms of the MVP, the UIViewController subclasses are in fact the Views and not the Presenters. This distinction provides superb testability, which comes at cost of the development speed, because you have to make manual data and event binding, as you can see from the example:

importUIKitstructPerson{// ModelletfirstName:StringletlastName:String}protocolGreetingView:class{funcsetGreeting(greeting:String)}protocolGreetingViewPresenter{init(view:GreetingView,person:Person)funcshowGreeting()}classGreetingPresenter:GreetingViewPresenter{unownedletview:GreetingViewletperson:Personrequiredinit(view:GreetingView,person:Person){self.view=viewself.person=person}funcshowGreeting(){letgreeting="Hello"+""+self.person.firstName+""+self.person.lastNameself.view.setGreeting(greeting)}}classGreetingViewController:UIViewController,GreetingView{varpresenter:GreetingViewPresenter!letshowGreetingButton=UIButton()letgreetingLabel=UILabel()overridefuncviewDidLoad(){super.viewDidLoad()self.showGreetingButton.addTarget(self,action:"didTapButton:",forControlEvents:.TouchUpInside)}funcdidTapButton(button:UIButton){self.presenter.showGreeting()}funcsetGreeting(greeting:String){self.greetingLabel.text=greeting}// layout code goes here}// Assembling of MVPletmodel=Person(firstName:"David",lastName:"Blaine")letview=GreetingViewController()letpresenter=GreetingPresenter(view:view,person:model)view.presenter=presenter

MVP example

Important note regarding assembly

The MVP is the first pattern that reveals the assembly problem which happens due to having three actually separate layers. Since we don’t want the View to know about the Model, it is not right to perform assembly in presenting view controller (which is the View), thus we have to do it somewhere else. For example, we can make the app-wide Router service which will be responsible for performing assembly and the View-to-View presentation. This assembly issue arises and has to be addressed not only in the MVP but also in all the following patterns.


Let’s look on the features of the MVP:

  • Distribution — we have the most of responsibilities divided between the Presenter and the Model, with the pretty dumb View (in the example above the Model is dumb as well).
  • Testability — is excellent, we can test most of the business logic due to the dumb View.
  • Easy of use — in our unrealistically simple example, the amount of code is doubled compared to the MVC, but at the same time, idea of the MVP is very clear.

MVP in iOS means superb testability and a lot of code.

MVP

With Bindings and Hooters

There is the other flavour of the MVP — the Supervising Controller MVP. This variant includes direct binding of the View and the Model while the Presenter (The Supervising Controller) still handles actions from the View and is capable of changing the View.

Supervising Presenter variant of the MVP

But as we have already learned before, vague responsibility separation is bad, as well as tight coupling of the View and the Model. That is similar to how things work in Cocoa desktop development.

This is the same as with the traditional MVC. I don’t see a point in writing an example for the flawed architecture.

MVVVM

The latest and the greatest of the MV(X) kind

The MVVM is the newest of MV(X) kind thus, let’s hope it emerged taking into account problems MV(X) was facing previously.

In theory the Model-View-ViewModel looks very good. The View and the Model are already familiar to us, but also the Mediator, represented as the View Model.

MVVM

It is pretty similar to the MVP:

  • The MVVM treats the view controller as the View
  • There is no tight coupling between the View and the Model

In addition, it does bindings like the Supervising version of the MVP; however, this time not between the View and the Model, but between the View and the View Model.

So what is the View Model in the iOS reality? It is basically UIKit independent representation of your View and its state. The View Model invokes changes in the Model and updates itself with the updated Model, and since we have a binding between the View and the View Model, the first is updated accordingly.

Bindings

I briefly mentioned them in the MVP part, but let’s discuss them a bit here. Bindings come out of box for the OS X development, but we don’t have them in the iOS toolbox. Of course we have the KVO and notifications, but they aren’t as convenient as bindings.

So, provided we don’t want to write them ourselves, we have two options:

In fact, nowadays, if you hear “MVVM” — you think ReactiveCocoa, and vice versa. Although it is possible to build the MVVM with the simple bindings, ReactiveCocoa (or siblings) will allow you to get most of the MVVM.

There is one bitter truth about reactive frameworks: “with great power comes great responsibility”. It’s really easy to mess things up when you go reactive. In other words, if do something wrong, you might spend a lot of time debugging the app, so just take a look at this call stack.

Reactive Debugging

In our simple example, the FRF framework or even the KVO is an overkill, instead we’ll explicitly ask the View Model to update using showGreeting method and use the simple property for greetingDidChange callback function.

importUIKitstructPerson{// ModelletfirstName:StringletlastName:String}protocolGreetingViewModelProtocol:class{vargreeting:String?{get}vargreetingDidChange:((GreetingViewModelProtocol)->())?{getset}// function to call when greeting did changeinit(person:Person)funcshowGreeting()}classGreetingViewModel:GreetingViewModelProtocol{letperson:Personvargreeting:String?{didSet{self.greetingDidChange?(self)}}vargreetingDidChange:((GreetingViewModelProtocol)->())?requiredinit(person:Person){self.person=person}funcshowGreeting(){self.greeting="Hello"+""+self.person.firstName+""+self.person.lastName}}classGreetingViewController:UIViewController{varviewModel:GreetingViewModelProtocol!{didSet{self.viewModel.greetingDidChange={[unownedself]viewModelinself.greetingLabel.text=viewModel.greeting}}}letshowGreetingButton=UIButton()letgreetingLabel=UILabel()overridefuncviewDidLoad(){super.viewDidLoad()self.showGreetingButton.addTarget(self.viewModel,action:"showGreeting",forControlEvents:.TouchUpInside)}// layout code goes here}// Assembling of MVVMletmodel=Person(firstName:"David",lastName:"Blaine")letviewModel=GreetingViewModel(person:model)letview=GreetingViewController()view.viewModel=viewModel

MVVM example

And again back to our feature assessment:

  • Distribution — it is not clear in our tiny example, but, in fact, the MVVM’s View has more responsibilities than the MVP’s View. Because the first one updates its state from the View Model by setting up bindings, while the second one just forwards all events to the Presenter and doesn’t update itself.
  • Testability — the View Model knows nothing about the View, this allows us to test it easily. The View might be also tested, but since it is UIKit dependant you might want to skip it.
  • Easy of use — its has the same amount of code as the MVP in our example, but in the real app where you’d have to forward all events from the View to the Presenter and to update the View manually, MVVM would be much skinnier if you used bindings.

The MVVM is very attractive, since it combines benefits of the aforementioned approaches, and, in addition, it doesn’t require extra code for the View updates due to the bindings on the View side. Nevertheless, testability is still on a good level.

VIPER

LEGO building experience transferred into the iOS app design

VIPER is our last candidate, which is particularly interesting because it doesn’t come from the MV(X) category.

By now, you must agree that the granularity in responsibilities is very good. VIPER makes another iteration on the idea of separating responsibilities, and this time we have five layers.

VIPER

  • Interactor — contains business logic related to the data (Entities) or networking, like creating new instances of entities or fetching them from the server. For those purposes you’ll use some Services and Managers which are not considered as a part of VIPER module but rather an external dependency.
  • Presenter — contains the UI related (but UIKit independent) business logic, invokes methods on the Interactor.
  • Entities — your plain data objects, not the data access layer, because that is a responsibility of the Interactor.
  • Router — responsible for the segues between the VIPER modules.

Basically, VIPER module can be a one screen or the whole user story of your application — think of authentication, which can be one screen or several related ones. How small are your “LEGO” blocks supposed to be? — It’s up to you.

If we compare it with the MV(X) kind, we’ll see a few differences of the distribution of responsibilities:

  • Model (data interaction) logic shifted into the Interactor with the Entities as dumb data structures.
  • Only the UI representation duties of the Controller/Presenter/ViewModel moved into the Presenter, but not the data altering capabilities.
  • VIPER is the first pattern which explicitly addresses navigation responsibility, which is supposed to be resolved by the Router.

The proper way of doing routing is a challenge for the iOS applications, the MV(X) patterns simply don’t address this issue.

The example doesn’t cover routing or interaction between modules, as those topics are not covered by the MV(X) patterns at all.

importUIKitstructPerson{// Entity (usually more complex e.g. NSManagedObject)letfirstName:StringletlastName:String}structGreetingData{// Transport data structure (not Entity)letgreeting:Stringletsubject:String}protocolGreetingProvider{funcprovideGreetingData()}protocolGreetingOutput:class{funcreceiveGreetingData(greetingData:GreetingData)}classGreetingInteractor:GreetingProvider{weakvaroutput:GreetingOutput!funcprovideGreetingData(){letperson=Person(firstName:"David",lastName:"Blaine")// usually comes from data access layerletsubject=person.firstName+""+person.lastNameletgreeting=GreetingData(greeting:"Hello",subject:subject)self.output.receiveGreetingData(greeting)}}protocolGreetingViewEventHandler{funcdidTapShowGreetingButton()}protocolGreetingView:class{funcsetGreeting(greeting:String)}classGreetingPresenter:GreetingOutput,GreetingViewEventHandler{weakvarview:GreetingView!vargreetingProvider:GreetingProvider!funcdidTapShowGreetingButton(){self.greetingProvider.provideGreetingData()}funcreceiveGreetingData(greetingData:GreetingData){letgreeting=greetingData.greeting+""+greetingData.subjectself.view.setGreeting(greeting)}}classGreetingViewController:UIViewController,GreetingView{vareventHandler:GreetingViewEventHandler!letshowGreetingButton=UIButton()letgreetingLabel=UILabel()overridefuncviewDidLoad(){super.viewDidLoad()self.showGreetingButton.addTarget(self,action:"didTapButton:",forControlEvents:.TouchUpInside)}funcdidTapButton(button:UIButton){self.eventHandler.didTapShowGreetingButton()}funcsetGreeting(greeting:String){self.greetingLabel.text=greeting}// layout code goes here}// Assembling of VIPER module, without Routerletview=GreetingViewController()letpresenter=GreetingPresenter()letinteractor=GreetingInteractor()view.eventHandler=presenterpresenter.view=viewpresenter.greetingProvider=interactorinteractor.output=presenter

Yet again, back to the features:

  • Distribution — undoubtedly, VIPER is a champion in distribution of responsibilities.
  • Testability—no surprises here, better distribution — better testability.
  • Easy of use — finally, two above come in cost of maintainability as you already guessed. You have to write huge amount of interface for classes with very small responsibilities.

So what about LEGO?

While using VIPER, you might feel like building The Empire State Building from LEGO blocks, this is a signal that you have a problem. Maybe it’s too early to adopt VIPER for your application and you should consider something simpler. Some people ignore this and continue to make more work for themselves. I assume they believe that their apps will benefit from VIPER at least in the future, even if now the maintenance cost is unreasonably high. If you believe the same, then I’d recommend you to try Generamba — a tool for generating VIPER skeletons. Although for me personally it feels like overkill.

Conclusion

We went through several architectural patterns, and I hope you have found some answers to your questions. I have no doubt that you realised there is no silver bullet, so choosing an architecture pattern is a matter of weighing tradeoffs in your particular situation.

Therefore, it is natural to have a mix of architectures in the same app. For example: you’ve started with MVC, then you realised that one particular screen became too hard to maintain efficiently with the MVC and switched to the MVVM, but only for this screen. There is no need to refactor other screens as the MVC actually works because both of architectures are easily compatible and can coexist in one application.

Make everything as simple as possible, but not simpler — Albert Einstein

Thank you for reading! If you liked this article, please hit ‘Recommend’ (the ❤ button) or leave a comment :)

Our Bounty Program at Badoo

$
0
0

Many IT companies now have their own bounty programs (i.e. programs that root out vulnerabilities), and Badoo is no exception.

In this article, we’ll discuss how we launched and support our bounty program without a dedicated information security division. We’ll discuss some of the problems we’ve encounted, how we solved them, and how we ended up with the program we use today. And, of course, we’ll recall some of the more interesting bugs that program participants have let us know about.

The bounty program has been running for three years, and participants from all over the world continue to report bugs to us.

We’d like to draw even more interest from foreign investigators, so we’ve opened our own page on the largest hacker portal hackerone.com, and increased the reward for finding vulnerabilites! Depending on the category, rewards range from £100 to £1000, and grand prize rewards are now £2000. The upper cap on rewards can go even higher if one finds a vulnerability that presents a real threat to our users.

Badoo is one of the foremost social networking services in the world. Through Badoo, you can meet your better half, start up new friendships, or just find someone nearby to chat with.

Currently, Badoo has over 300 million registered users that speak more than 50 languages and live around the world. Our work is supported by around 3000 servers located in four data centers (in America, Europe, Asia, and Russia). We offer an assortment of apps for all popular mobile platforms, the mobile web, and web versions for desktop browsers. We are a highload project; at peak traffic we handle 70-80 thousand requests per second.

Given the scale of our project, our time and resources are understandably spread in many different directions, only one of which being information security.

It might strike some as strange then that we have no information security division or any specialists that handle our security issues exclusively.

Nevertheless, our users are very dear to us and we care about the security of their data. Despite the fact that we don’t support the features that a bank does, such as an online payment system, or a way to withdraw funds (although we do have a system for receiving our internal “Badoo credit” currency), we still have a lot of data that needs to be protected.

To further complicate the matter, we adhere to the standard business wisdom that any security efforts shouldn’t impinge on the customer’s ease of use. For example, we have a feature that allows users to access their account directly from an email link, without having to re-enter their username and password. From a data security point of view, this is risky, however we take all necessary precautions and employ mechanisms to protect user data in such cases. But, of course, no system is perfect.

It’s difficult to answer the question “Why don’t you have a dedicated security division?”, but we’d ask that you keep in mind that our project grew from a small start-up. And any start-up is initially focused on developing quickly and reaching certain business goals, rather than focusing primarily on things like quality, security, and other areas that are the realm of mature projects. For this reason, the majority of startups don’t get off the ground thinking about their testing division, and put off thoughts of an information security division till even later. This is to be expected.

At our company, we still employ certain methods and approaches that are more characteristic of a startup. Though we’ve dealt with information security from the very beginning (and our guys are consummate pros), we haven’t had a systematic approach to the issue. The day will almost certainly come when we will either employ dedicated specialists or create an entire information security division.

But for now, these are the security measures we take in addition to supporting the bounty program:

  • System administrators handle infrastructure, network, and related security issues. They started dealing with these issues earlier than all the other teams.
  • Developers check their code for security issues and potential errors (including during code review).
  • Testers also manually test for the security of individual features (for example, unauthorized access or code injection) during autotests and task validation. We regularly run checks using automated tools, primarily Skipfish and Acunetix.
  • We hold the highest level of PCI DSS certification. Frequent information security measures are undertaken in accordance with this standard: complete isolation of development and code deploy processes from the payment-processing tools; regular penetration testing, regular infrastructure audits, and much more. Carrying out these measures have not only allowed us to achieve the highest category of the security standard, but also to maintain our certification for the past several years.
  • We’ve tried to design our code development system, framework, template engine, runtime environment, etc., to minimize the potential for errors as well as the negative consequences if they do occur. A simple example of this is that we patched our php-fpm so that it only uses PHP code from certain directories (with no write access, of course). This way, even if some third-party code manages to reach our servers, we minimize the launch risk.

Third-party security specialists have also helped us a lot. Some of them we reached out to, others reported bugs to us on their own volition and were rewarded for it. But the process itself hasn’t proceeded in an organized fashion, so we’ve decided to systematize it.

We first started to seriously consider having a permanent bounty program a few years ago, motivated on one hand by the challenge and its potential to propel us to a new level and, on the other hand, by the fact that this could help us figure out where we stand as far as security goes.

Preparation

It goes without saying that we didn’t start by immediately trying to launch the program, but rather by running a few internal checks to ensure us that we were ready for the next steps.

  • We made sure that access to databases and other resources was in order. We checked the startup rights, write rights, and other actions for system users on the servers. It goes without saying that this amounted to quite a bit of work and a lot of things had to be redone in order to satisfy the new requirements. We changed the API access in a few spots so that it’d be easier to monitor.
  • We strengthened our defense against XSS attacks on our template-creator system. Now everything is escaped by default, rather than just when a programmer explicitly writes for it to be.
  • We also went through a few stages of internal audits on our entire system, code and environment.
  • We put together tools for processing user reports. Launching a big program is not something we wanted to do right away. To start, we decided to experiment by launching it for a month. Given the time frame, we also didn’t want to spend much time messing around with tools. First, we organized a system to process reports in Google Groups (we use Google Mail at Badoo). The interface is very customizable, so it was easy to put together categories for things like whether the report was received yet, what category it was assigned, whether it had been processed yet, and if money had been paid out.
  • We decided to start with the main website in order to see how effective the program would be.
  • Also, we decided to stick with the Russian internet at first for a few reasons: we were already acquainted with a few top-notch Russian hackers who had asked to participate, and we wanted to avoid an international fail in the event that the program proved unsuccessful.

In the corporate site’s interface, we put together a page listing reports along with their statuses and categories. Descriptions of vulnerabilities were shown only after they’d been fixed. New reports could also be sent in through this page.

This format was very convenient: participants got an email from us with their report number and could track it in the list. They could also share the link to the page with others, raising their own karma.

For additional motivation, we created a participant rating system which was based not only on the number of reports from a given person, but also how critical they were. At the end of the month, we rewarded the leaders with special prizes.

This page also helped us point out duplicate entries. If a participant submitted a duplicate report, they got a notification containing the number of the report that had already been accepted. As a result, they could then track the status of that report.

As far as rating the bugs found by participants, we decided not to use any of the widely-used systems of vulnerability assessment like OWASP. To the contrary, we assigned categories based on the potential harm that a given vulnerability might inflict on our users. In total, we created five categories and standard rewards ranged from 50 – 500 pounds sterling.

At first glance, this may seem like a strange approach, but some of the less critical vulnerabilities according to OWASP [www.owasp.org] could inflict the most harm on our users, so they needed to be ranked higher. To this effect, we even offered super-rewards of £2000 or more for discovering especially critical vulnerabilities. We decided to also encourage our investigators to think of the most effective vector of attack (rather than just pointing out the error’s existence). Further on in this article, we’ll give an example of a simple XSS-vulnerability that we paid a super-reward for because of the unusual and particularly interesting method of attack the participant indicated.

Contest

Thus, D-Day was underway. We launched the program, announced it via several sources, and sat back waiting for the results. Keep in mind that this program was only in effect for a month. It was a stressful, but very productive time for us. Here are the results of “security month” at Badoo:

  • We received over 500 reports of potential risks
  • About 50 of them turned out to be duplicates
  • Around 150 reports were just bugs or improvement suggestions that had nothing to do with security
  • A little over 50 of them concerned actual vulnerabilities
  • About half of the vulnerabilities were types of CSRF

The majority of bugs reported over the course of several days were CSRF. Pages where users fill in information about themselves, upload photos, etc. were the most impacted. It’s not as if we had no defense against CSRF attacks at that time though. Indeed, many of our pages were protected using session tokens. Nevertheless, it turned out that not all user actions on the site were protected from CSRF threats.

We responded quickly and launched a major project to defend user data against CSRF in the course of a few days. All pages and web services were redone to check CSRF tokens by default. This way, we got rid of practically all CSRF vulnerabilities on our site.

Interesting bugs

The top three most embarrassing bugs that we discovered during Security Month were the following:

  • Our app contains a “credit” system –users change real money for our internal currency that can then be used to buy certain services. The error that was found was a very stupid one and had been in our system for quite some time. Yet we wouldn’t have found out about it if we hadn’t launched the bounty program. The main problem was that despite our precautions, a mistake in the code was causing the number of credits that the user should receive after payment to be taken directly from the form. So by changing these values in the page’s html code, hackers could get away with paying a small amount and then crediting their account with a much higher figure than they were due. We analyzed transactions and paid a respectable sum to the person who discovered this vulnerability (despite the fact that no one else had taken advantage of it).
  • The second bug was almost as simple. It concerned a handler that was not validating a parameter value correctly. This parameter identifies users when they make changes to their data. As a result, when user_id was changed in a request, it was possible to change certain information belonging to other users. As you can see, both of these errors are simple, but they were given the highest rating.
  • And the third bug? Users can link their external social network accounts to their Badoo account, and then log in from these other accounts. Someone found an error in the linking mechanism that allowed one to link their external social network account to the Badoo account of a different user, thus gaining access to someone else’s profile.

What’s next?

During the contest, we learned to respond quickly to participant reports (within a few hours), fix the bugs, and create a separate project dedicated to defense against CSRF-attacks. It was clear that this program was important and essential. The time and resources spent preparing for and implementing it were more than justified. So we decided to take things further and began preparing for a permanent security vulnerability search program. First we just needed to change the format a bit.

During the contest month, we learned how a convenient report processing flow should look and decided to switch it over to JIRA (all our company’s tasks are handled using this system).

The simple flow in Google Groups looked like this:

Participant reports came to us from one of two paths: either via the form on the corporate site or via the special email address. The contest jury carefully evaluated reports and, in the case of real threats, sent them to specific teams to be dealt with. Our developer relations manager helped us handle communications with participants at this stage, although many of the emails could have been generated automatically.

The flow in Jira, which we put together for the permanent program, looks a bit different and allows for automated answers to be sent to participants.

We also decided to add mobile apps to the permanent program.

Around this time we ended up on the Bugcrowd bounty list, which upped our popularity and helped draw the attention of investigators from all over the world. Having made our big appearance, we proceeded to launch the program.

Results of the permanent program

Unfortunately, results of the permanent program haven’t proved to be quite as impressive as the contest month’s, as we receive more irrelevant vulnerability reports than we did before. Nevertheless, we were tipped off to several interesting bugs during the first year of the “big” program.

  • We were sent about 870 reports
  • About 50 of these concerned actual security vulnerabilities
  • More than 30 of the reports sent during the first year turned out to be duplicates
  • More than 20 of the bugs concerned mobile apps

As we expected, our program generated a lot of buzz, and the number of useful reports we received was not insignificant, so all in all, we are satisfied with the results.


Interesting bugs

Many of the interesting bugs found during the first year concerned our mobile apps. We were, of course, grateful to find out about them, as this area is new territory for developers and hackers alike.

Here are the top three bugs found during the first year of the program:

  • A special prize was awarded for the discovery of a vulnerability with a very atypical attack vector. We use a comet server to send messages of various types over already open connections in many places. This technology is also used to show “pencils” in messengers, i.e. the standard indication that someone is currently writing to you. This type of message contained an unused field left there from past troubleshooting, and it was possible to enter random data in it. This random data was transmitted “as is” to the user receiving your message, and was processed there as html. This involved atypical XSS, content spoofing, DDoS, and many other factors. To top it off, the victim didn’t have to do anything other than open their messenger. Given that this vulnerability could have been exploited on a mass scale, we decided to award a special prize for its discovery. The fix was a quick and simple one: we just deleted the unused field. Then we checked all analogous areas.
  • The second interesting error was found in Badoo’s mobile app for Android. Our developers discovered an intriguing hack at one point: to speed up rendering, they used their own handler via addJavascriptInterface in the Android API on screens using webview, which basically did nothing other than instantiation. And in the case of MitM attacks (when you can’t completely trust the data arriving to the client) JavaScript hackers can get into the interface. This way, random code could be executed on client devices.
  • The third bug (sent to us by the same person as the previous one) had to deal with our cache loader in the Android app not adequately checking the path to the cache itself. As a result, it could be used to get app files (after all, the loader works with the same rules as the app) including authorization keys to external systems (for example, in order to log in through Facebook, etc.).

All in all, we feel that the program has proved very rewarding and effective. We get expert feedback on our apps and sites every day from hackers all over the world and work to improve our services based on the latest findings in the field of information security. We value the trust of our users and aim to do everything we can to protect their information.

We’d love for you to participate in our program and apply your skills to seeking out bugs on the sites and apps of other companies. The process itself is often very entertaining and you will not only come away from it with newly-acquired knowledge, but also some extra cash. We pay in pounds sterling and recently increased the prize amount substantially.

The best way to send us bug or vulnerability reports is through the Hackerone platform. Happy bug hunting!

Ilya Ageev,
Head of QA, Badoo.

PS: While we were writing this article, Hackerone let us know about more interesting vulnerabilities, which we’ll let you know about in a later post.

Pardon My French - Introducing Chateau

$
0
0

Being a social networking platform, providing a great chat experience is at the core of what we do at Badoo. However, the meaning of “a great chat experience” is constantly evolving, and the major chat applications keep adding new features to stay competitive and enhance user experience.

Chateau Example Screenshot

It’s the same for us at Badoo, and as we’ve added more and more features to our chat, our existing code base and architecture has struggled to keep up with the demands. What was once clean and well tested code has grown in unexpected ways while accumulating technical debt. With this problem in front of us we were faced with a choice that is familiar to all developers: Rewrite or refactor?.

In the end we opted to rewrite, and our decision was based on several good reasons.

  1. The great success of Chatto (Badoo’s chat framework for iOS) gave us confidence and a good idea of what we could achieve.
  2. During the years since our existing chat codebase was written, several architectural concepts have gained popularity for Android, some of which could greatly help us simplify the code.
  3. Our commitment to open source. We are always looking for open source projects to contribute to as well as create and this was a good opportunity to fill an mostly empty niche for Android.

With that decision out of the way there was still many remaining questions. What should our architecture look like? How, when and what should we open source? What exactly did we want to achieve with this project?

Setting the goals

From the start we already had a pretty good idea of what our internal needs and requirements were. However, since our goal from the start was to make Chateau an open source project, we also needed to keep flexibility and extendability in mind whenever we made a design decision. This was reflected in our design goals:

  • Easy to extend: It must be easy to add new features (e.g GIFs, stickers, voice messages) without affecting other chat functionality.
  • Easy to integrate: It should be easy to integrate the framework into your application, independent of the type of backend and architecture used.
  • Easy to understand: It should be easy to use and work with the code base for someone who is not part of our development team.
  • Easy to test: Chat has many complex user interactions as well as error cases. Having easy to test code (both using unit and integration tests) is critical to support adding features as well as refactorings.
  • High performance: The framework must not introduce abstractions or patterns that adds a noticeable performance overhead.

To make this a reality though we needed both the right tools and a great architecture…

An architecture fit for a Chateau

Even though Chateau was written from scratch it’s architecture encompasses years of experience of writing chat components and features for Badoo.

Chateau Architecture

Since we’ve already adopted the Model-View-Presenter (MVP) pattern in our other applications and it had allowed us to create good testable code, it was a natural choice to use it for Chateau as well. Of course, not all MVP implementations are alike, and we still needed to pick a flavor of MVP (there are quite a few out there in the wild). For more information on what we chose to use take a look at our documentation.

The other piece to the puzzle was Clean Architecture, a concept put forward by Robert C. Martin (Uncle Bob). By applying this principle we divide the application into several discrete layers, where all communication between the layers must adhere to the Dependency Rule, dependencies in the code must always go downwards (here the upper layers are defined to be UI and as we work our way down we pass through Views, Presenters, Use cases and Repositories/Data sources).

For us, the main benefit of adopting Clean Architecture was similar to what we gained from using MVP in the upper layers of the app. It allowed us to create discrete components that could be tested individually (using fast running unit tests instead of Android tests). As an additional benefit it also made Chateau very pluggable, in the sense that you can easily replace the data storage, UI or networking with an alternative implementation.

Lessons learned

Developing Chateau has been a wild ride so far (and we’re not done yet!) but also a lot of fun. Along the way we’ve laughed, cried and learned a lot of valuable lessons, here are some of them.

  • Striking the balance between providing a complete hat framework (with UI, caching and networking code) compared to providing something that you can use as a basis for your own custom chat is hard. While it’s it’s nice to have something that is ready “out of the box” you still need to be able to completely swap out parts of the stack (like UI or networking). For Chateau we chose to use the the Example app as our way of providing a full chat experience.
  • Learning RxJava at the same time as you are developing a framework that makes use of it can be a challenge and will lead to plenty of refactorings along the way (another great reason to have good code coverage).
  • It pays off to plan ahead when considering how your library should be built and used to make sure that the process fits both internal and external needs. We wanted to distribute the libraries using jCenter for external distribution while still including them as a regular gradle project dependency when building our apps (to make cross module refactorings easier). This also involved using git subrepo to allow the Chateau libraries to be just another folder in our main git.

What’s next?

At the moment there is still some functionality (i.e UI/View implementations and support for specific backends) that is not included in Chateau itself, that needs to be added or implemented in the application where it’s being used. For the future we are aiming at reducing the amount of code needed to fully integrate Chateau. Ideally we would like Chateau to be an almost drop-in component, given that your chat backend is supported.

We are also working on creating a better testing (or mock) backend that can be used to get up and running with the Example app. Keep an eye on our tech blog as well as the project GitHub page for updates.

References and further reading


Crazy Agile API

$
0
0

There are lots of articles and books on how to properly design APIs, but only few of them cover the case of a constantly changing API. In fast moving companies, releasing many features per week/day and changing the API is often a necessity. This article will explain how we’ve handled this at Badoo, some mistakes we’ve made along the way and lessons we’ve learned.

Firstly, let me provide a general description of how things work at Badoo, who are the API consumers and why does has it changed so frequently.

API and Consumers

As wikipedia states, “…API is a set of routines, protocols, and tools for building software applications. The API specifies how software components should interact…”. Our Badoo API is a set of data structures (messages) and values (enum values) that the client and the server send to each other. It is written in Google protobuf definitions and stored in a separate git repository.

We have 6 consumer platforms for our API - Server and 5 clients: Android, iOS, Windows Phone, Mobile Web and Desktop Web. We also have multiple apps which all use the same API. To make them work together we have quite big API in place, here are some numbers:

  • 450 messages, 2665 fields
  • 135 enums, 2096 values
  • 125 features flags that can be controlled from server
  • 165 functionality flags. We call them supported features

This feature should’ve been done yesterday!

At Badoo, we value fast delivery of features. The logic behind it is simple: the faster something is on production the sooner the users will get value from it. We also run loads of A/B tests in parallel but that’s out of the scope of this article.

Idea to production can take a week, this includes writing the spec, getting the designs, developing the API, writing tech documentation, implementing it on the various clients, testing, and releasing. On average it would take about a month though. This doesn’t mean we release one feature per month as we work on loads of features in parallel.

How to do the impossible?

Let’s say that the product owner has a new idea and comes to API team asking to update the protocol so that all clients can start implementing it.

First of all, there’re a lot of people who usually work on the feature:

  • Product owner
  • Designers
  • Server side developers
  • Client side developers for different platforms
  • QA
  • Data analysts

How can we ensure that we all understand each other and speak the same language? We need requirements. Proper requirements To address that product owners (POs) create Product requirements document (PRD). Basically they create a wiki page with different requirements, use cases, flow descriptions, designs sketches, etc.

Then, based on the PRD we can start implementing the required API changes.


Protocol design approach.

There are endless ways to split the responsibilities between the client and the server. They range between “Server implements most of the logic” and “Client implements most of the logic”. Let’s examine pros and cons of each approach:

  1. “Server implements most of the logic” - The clients act more like a View from MVC pattern.

       + Implement functionality only once for all platforms - on the server.

       + Can update server-side logic and lexemes without changing and releasing client app (Big plus when it comes to native apps).

       - More complex protocol - usually more steps in the flow and more data passed.

       - If the behaviour is different on different clients, server still has to have and support separate implementation for the feature for every client and it’s supported versions.

       - Can affect user experience on slow/unstable networks.

       - Keeping business logic on the server side makes it hard or even impossible for some features to have some offline behaviour.


   2. “Client implements most of the logic” - Client has all the logic and uses server as data source (like many public data-oriented APIs).

       + Better user experience due to less waiting on server replies.

       + Works much better offline and poor connections.

       + Caching is much simpler.

       + Easier to implement different behaviour on different platforms and versions of the clients if required.

       + less complex flows - teams can be more autonomous.

       - Slower - every client has to make complete own implementation of all logic rather than only once on the server.

       - To make some even minor changes,all clients need to be updated separately.

       - More bug prone - every client has to make their complete own implementation of all logic.

On one hand, using the first approach means you only need to implement certain business logic once on the server which is then used on all the clients. On the other hand specific platforms often have their own peculiar properties, specialised lexemes structure, different functionality and features implemented at different points in time. Thus it makes it easier to make protocol more data-oriented so clients have some freedom for doing things their own way.

It turns out that there is no silver bullet, as usual, and in Badoo we balance the two approaches to maximise the advantages.


Technical documentation

Several years ago when we were smaller we only had the server and two clients (Android, iOS) using the protocol. It was easy to communicate verbally on how to use it, so the documentation only included some generic logic in the comments for proto definitions. Here is an example of what it looked like:

Later three more client platforms joined - Windows Phone, Mobile Web and Desktop Web. Communicating everything verbally was very costly so we started writing better documentation. This documentation was more than just comments about fields but it included an overview of the feature, flow diagrams, screenshots and message examples. Here is an example of what it looks like:

All six platforms and QA use this documentation as a technical reference alongside the PRD when starting to implement features.

The documentation is not only used for new feature implementation but also for redesigns or refactorings when historical information is required. We can now easily tell developers to RTFM which saves the company a great deal of time. Without it, the clients would have to check how things are implemented on the server to understand what is or isn’t safe to do. It’s especially risky when it comes to edge cases. It’s also a great way for new joiners to understand how things are suppose to work.


Tools we use to make our documentation sexy

We write technical documentation in eStructuredText format and keep it in git repo together with protocol definitions. Using Sphinx we compile it to html pages that all developers can access in our company’s internal network.

The documentation is split into several major sections that cover different aspects of the protocol and other stuff:

  • Protocol - contains generated documentation from comments in proto definitions Product features - contains technical documentation on how features should be implemented. Flow diagram etc..
  • Common - contains docs about protocol and flows not specific for particular product features.
  • App specifics - Badoo has more than one product, this section highlights the differences between the products. As mentioned above the protocol is shared.
  • Statistics - general description on how analytics and stats should be processed and collected in our apps.
  • Push - documentation regarding native push notifications
  • Architecture and infrastructure - top-level structure of the protocol, binary protocol formats, ab-testing framework, etc.


OK, we did the API changes. Are we good to go?

At this point we have designed the protocol and written the technical documentation based on the product specifications. The PRD and the protocol get reviewed. In the first stage at least one API team member reviews the changes. Then every platform which will be implementing the feature has to review documentation, protocol and approve it.

This stage helps us to get feedback on:

  • How the changes correspond to the existing code on each platform
  • Check if the proposed protocol is optimal for each of the platforms. We have had cases in the past where we’ve tried to reuse a message, as it was the cleanest solution, but it affected server performance.
  • Ensure the change is backwards compatible.

After the review is complete, clients and server can start implementing the feature itself.

But is our job done? No!

As we work in an agile environment, situations often occur where a product owner will want to change the workings of a feature that is being implemented, but at the same time they want to release what we have now. Or even better, sometimes they decide to change this feature on one platform and leave it as is on the others.


A feature is like a baby - It keeps evolving.

Features evolve. We run A/B tests and/or learn from the data we get after releasing new features. Looking at the data we understand that the feature needs some tweaking. So the product team changes the PRD. This creates a “problem”. The newly revised PRD no longer matches the protocol definition and documentation. Moreover, some platforms may have already implemented the previous PRD, where some have yet to start. To overcome this problem we decided to version the PRDs. Let’s say the feature is released on one platform as version R3 of the PRD. After some time the product owner decides to tweak the feature and updates the PRD to version R5. At this point we also need to update the protocol and technical documentation to match the updated PRD.

For tracking PRD updates we use versioning in Confluence (Atlassian’s wiki like product). In the protocol technical documentation we add links to a specific revision of the PRD by just adding ?pageVersion=3 to the wiki page address or obtain a diff link through history. This way every developer knows which version or part of the PRD protocol that has been done.

PRD diff are treated like a new feature. Product owners accumulate changes (R1, R2, …) until they decide it should go to development. They create an API task, where changes to the protocol are implemented and then they get to platforms as a single unit of work. When product adds the next set of changes, another API ticket is created and then change is implemented to platforms in same way:

After we’ve got the link to PRD diff, we start the flow again from the protocol changes and so on. It is a bit more complicated as we still need to support previous functionality and not break already released client apps that have versions R3. Basically at our disposal we have several levels of protocol changes management.


Protocol changes management

In the previous section we looked at the PRD versioning. In order to implement those changes in the API we need to examine our options for protocol versioning. There are three main options (levels) here, each of them have their own pros and cons.


Protocol level

This approach is widely used for slow-changing public APIs. When new version of protocol are released, all the clients are suppose to start using it instead of the old one. We can’t use it as different client platforms have different sets of features implemented. Let’s say we have set of protocol versions:

V1. Supports features A, B, C
V2. Supports features B’, C and D, where B’ is an updated feature B (which has a different flow)

So if the client needs to implement feature D, it will also have to upgrade feature B to B’, which might be not needed at the moment.

At Badoo we never used this versioning approach. For our situation, the two options described below are a better fit.


Versioning by message

Another approach would be to create new message (data structure in protobuf) with an updated set of fields when the feature has changed. This works quite well if the requirements change significantly.

Code sample:

At Badoo every user has albums. In the past users could create their own albums and put photos in them:

  AddPhotoToAlbumV1 {
	required string album_id = 1;
	required string photo_id = 2;
}

Later on our product team decided to have only 3 predefined album types: my photos, other photos and private photos. For the clients to be able to distinguish between those types, we prefer using an enum so the next version of the message may look like this:

AddPhotoToAlbumV2 {
	required AlbumType album_type = 1;
	required string photo_id = 2;
}

This approach sometimes works well, but be careful! If the change is not implemented on all platforms in a short time, you will end up supporting (adding more changes to) both old and new versions, which will create even more mess.

Field/value level.

If it is possible, we reuse the same message/enum, maybe deprecate some fields/values in it or add new ones. This is probably the most common approach in our protocol.
For example:

AddPhotoToAlbum {
	optional string album_id = 1 [deprecated=true];
	optional string photo_id = 2;
	optional AlbumType album_type = 3;
}

In this case clients can keep using same message, but new client versions can switch to album_type instead of album_id.

On a side note we always use the optional fields. This gives us the flexibility to deprecate fields. (Google reached the same conclusion).


Supporting protocol changes

Our protocol is shared between our server and our 5 client platforms. As our clients release a new version each week (resulting in ~20 app versions per month, all of which can behave differently and use different parts of the protocol), we can’t just create a different protocol version for every app release. Such protocol versioning will require server to support thousands of various combinations of apps behaviours, which is far not ideal.

A better option—the one we decided implement— would be for each client to declare at the start which versions of the protocol bits they support. This allows the server to be client agnostic when it comes to feature support and just rely on the list of supported features provided by the client..

E.g. A while ago we implemented the “What’s New” feature. This allows us to inform the user of new features in the app. Clients that support it send the server a SUPPORTS_WHATS_NEW flag. The server then knows that it can send What’s New messages to the client and that they be displayed correctly.


How to keep the protocol clean?

For public API the deadline is usually set and an old part stops working on this date. At Badoo this is not always possible as tasks to implement new features often have a much higher priority then removing old features. Thus we have a 3 stage process for that.

For the first stage as soon as it is clear that part of the protocol is to be removed, it is marked as “deprecated” and tickets for all client platforms are created to remove the code.

During the second stage, all the clients should remove deprecated protocol usage from their code. At this point server can’t remove code as some older versions of apps can still be in production.

During the last stage when all clients have removed their code and no production versions left that use the protocol, it can then be removed from server code and protocol itself.

Patience and people!

Above we presented several technical and organizational approaches that we have adopted or invented here at Badoo. However we haven’t talked at all about communication. Communication is 80% of our work. It is very important that you have people on your side to move things fast. Luckily many of our developers support us with what we need as they remember well the pain associated with non-standardized solutions across platforms.

We’ve realised that a well documented API also helps non developers understand it and it’s development workflow. QA uses it to improve testing and our product team uses it as a reference to understand what can be done with minimal protocol changes.

Conclusion

When designing a protocol and the processes around it, you need patience and pragmatism. The protocol and process have to cater all combinations of teams, versions and platforms dealing with legacy clients and more. Nevertheless it is a very interesting and challenging task. With little literature on how to design fast-changing APIs, we hope you find this article interesting and that it has given you some useful insights on how to make this task a little easier.

Thank you for reading and any comments are more than welcome.

Ivan Biryukov - Mobile Architect
Orene Gauthier - Head of Mobile Engineer

Collection and Analysis of Daemon Logs at Badoo

$
0
0

We have a few dozen homebrew daemons at Badoo. Most of them are written in C, one in C++ and five or six in Go. They run on about a hundred servers in four data centres.

At Badoo the monitoring division is responsible for keeping track of and sorting out problems with daemons. Our staff use Zabbix and scripts to check that the service has launched and is responding to requests. Additionally, the department examines statistics for daemons and the scripts that work with them, looking for anomalies, sudden spikes etc.

Until recently, we have been missing an important part: collection and analysis of logs files written locally by each daemon. This information can help to either catch a problem at a very early stage, or to understand the reasons for a failure after the fact.

So we designed a system to handle this task and are excited to share the details with you. I’m certain that some of you reading this will face a similar task, and we hope that this article will help you avoid some of the errors that we made.

Choice of tools

From the outset we decided against using a cloud system because we have a policy about keeping data inside the company whenever possible. After analysing some popular tools, we concluded that the three following systems best suited our needs:

Splunk

We tried out Splunk first. Splunk is a turnkey system, a proprietary solution whose cost depends on how much traffic the system deals with. We already use this system in the billing department and our colleagues are very pleased with it.

We piggy-backed on their installation, but soon realised that our traffic exceeds the limit we are paying for.

One thing we took into account was that some of our colleagues complained about the complexity and non-user-friendliness of the UI when testing. Though our billing colleagues were already versed in Splunk and had no problems with it, we still took the UI complaints seriously since we want our system to be used actively.

From the technical POV Splunk completely suited our needs. But its cost and awkward interface forced us to look further.

ELK: Elastic Search + Logstash + Kibana

ELK was next on our list. ELK is probably the most popular system to date for collecting and analysing logs. And this is not really surprising since it is free, easy to use, flexible and powerful.

ELK consists of three components:

  • Elastic Search: A data storage and retrieval system based on the Lucene engine
  • Logstash: A “Pipe” with a bunch of features that send data (possibly processed) to Elastic Search
  • Kibana: A web interface for searching and visualising data from Elastic Search

Getting started with ELK is very simple: you just have to download three archives from the official site, unzip them and run a few binaries. The system’s simplicity allowed us to test it out over a few days and realise how well it suited us.

It really did fit like a glove. Technically we can implement everything we need, and, when necessary, write our own solutions and build them into the general infrastructure.

Despite the fact that we were completely satisfied with ELK, we wanted to give the third contender a fair shot.

Graylog 2

Overall, Graylog 2 is very similar to ELK: the code is open source, it’s easy to install, and it also uses Elastic Search and gives you the option to use Logstash. The main difference is that Graylog 2 is ready to use “out of the box” and designed specifically to collect logs. Its end-user readiness is reminiscent of Splunk. It has an easy-to-use graphical interface with the ability to customise line parsing directly in your web browser, as well as access restrictions and notifications.

Nevertheless we concluded that ELK is a much more flexible system that we could customise to suit our needs and whose components could be changed out easily. You don’t want to pay for Watcher - it’s fine. Make your own. Whereas with ELK all the components can be easily removed and replaced, with Graylog 2 it felt like removing some parts involved ripping out the very roots of the system, and other components could just not be incorporated.

So we made our decision and stuck with ELK.

Log delivery

At a very early stage we made it a requirement that logs have to both end up in our system and remain on the disk. Log collection and analysis systems are great, but any system experiences delays or malfunctions. In these cases, nothing surpasses the features that standard Unix utilities like grep, AWK, sort etc. offer. A programmer should always be able to log on to the server and see what is happening there with their own eyes.

There are a few different ways to deliver logs to Logstash:

  • Use utilities available from the ELK set (Logstash-forwarder, or (as of recently) Beats). They are a separate daemons that monitor files on disk and pipe them into Logstash.
  • Use our own utility called LSD, which we use to deliver PHP logs. This is also a separate daemon that monitors log files in the file system and pipes them in where they need to go. On one hand, all the problems that could occur in LSD were taken into account and addressed while sending huge quantities of logs from a vast number of servers, but, on the other hand, the system is too focused on PHP-scripts, which meant that we would have to modify it.
  • Send logs to the syslog (which is the standard in the Unix world) alongside recording to disk.

Despite the shortcomings of this last approach, it was so simple that we decided to try it.

Architecture

Server and rsyslogd

We sketched out a plan that seemed reasonable to us with the help of the system administrators. This involved putting one rsyslogddaemon on each server, one main rsyslogddaemon per platform, one Logstash per platform and one Elastic Search cluster closer to us, i.e. in the Prague data centre, which is closer to Moscow.

Each server looks like this:

Because we use Docker in some places at Badoo, we planned to mount the /dev/log socket inside the container using built-in features. This is how the final architecture looked :

This architecture seemed to work well at preventing data loss: in case of problems, each rsyslogd daemon will cache messages on disk and send them later. The only data loss that could occur would be if the very first rsyslogddaemon didn’t work. We knew the risks and decided that for logs, it is not worth spending our time on these small niggles at this point at time.

The format of the log line and Logstash

Logstash is a pipeline where lines of text are sent. They are parsed internally and then go into Elastic Search in a form that allows fields and tags to be indexed.

Almost all our services are built using our own libangel library, which means that they all have the same log format that looks like this:

Mar 04 04:00:14.609331 [NOTICE] <shard6><16367> storage_file.c:1212
storage___update_dump_data(): starting dump (threaded, updating)

The format consists of a common part, which is the same everywhere, and parts that the programmer puts in when calling one of the logging functions.

The general section includes the date, time down to microseconds, log level, tag, PID, the file name and line number in the sources, and the name of the function, i.e. the most basic things.

Syslog then adds its own information: time, PID, the hostname server, and the so-called ident. Usually this is just a program name, but really anything can go there.

We standardised “ident” as the daemon’s name, secondary name and version. For example, meetmaker-ru.mlan-1.0.0. Thus we can distinguish logs from various daemons, as well as from different types of single daemon (for example, a country or replica) and have information about the daemon version that’s running.

Parsing this type of message is fairly straightforward. I won’t show examples of config files in this article, but it basically works by biting off small chunks and parsing parts of strings using regular expressions.

If any stage of parsing fails, we add a special tag to the message, which allows you to search for such messages and monitor their number.

A note about time parsing: We tried to take different options into account, and final time will be the time from libangel by default (so basically the time when the message was generated). If for some reason this time can’t be found, we take the time from syslog (i.e. the time when the message went to the first local syslog daemon). If, for some reason, this time is also not available, then the message time will be the time the message was received by Logstash.

The resulting fields go in Elastic Search for indexing.

ElasticSearch

Elastic Search supports cluster mode where multiple nodes are combined into a single entity and work together. Due to the fact that each index can replicate to another node, the cluster remains operable even if some nodes fail.

The minimum number of nodes in the fail-proof cluster is three — three is the first odd number greater than one. This is due to the fact that the majority of clusters need to be available when splitting occurs in order for the internal algorithms to work. An even number of nodes will not work for this.

We have three dedicated servers for the Elastic Search cluster and configured it so that each index has a single replica, as shown in the diagram.

With this architecture if a given node fails, it’s not a fatal error, and the cluster itself remains available.

Besides dealing well with malfunctions, this design also makes it easy to update Elastic Search: just stop one of the nodes, update it, launch it, rinse and repeat.

The fact that we store logs in Elastic Search makes it easy to use daily indexes. This has several benefits:

  • If the servers run out of disk space, it’s very easy to delete old data. This is a quick operation, and there is a tool called Curator that is designed for this task.
  • During searches of time-spans of more than one day, the search can be conducted simultaneously. Furthermore, the search can be run simultaneously on multiple nodes.

As mentioned earlier, we set up Curator in order to automatically delete old indexes when space is running out.

The Elastic Search settings include a lot of details associated with both Java and Lucene. But the official documentation and numerous articles go into a lot of depth about them, so I won’t repeat that information here. I’ll only briefly mention that the Elastic Search will use both the Java Heap and system Heap (for Lucene). Also, do not forget to set “mappings” that are tailored for your index fields to accelerate work and reduce disk space consumption.

Kibana

There isn’t much to say here :-) We just set it up and it works. Fortunately, the developers made it possible to change the timezone settings in the latest version. Earlier, the local time zone of the user was used by default, which is very inconvenient because our servers everywhere are always set to UTC, and we are used to communicating by that standard.

Notification system

A notification system was one of our main requirements for a log collection system. We wanted a system that, based on rules or filters, would send out triggered alerts with a link to the page where you can see details.

In the world of ELK there were two similar finished product:

Watcher is a proprietary product of the Elastic company that requires an active subscription. Elastalert is an open-source product written in Python. We shelved Watcher almost immediately for the same reasons that we had for earlier products because it’s not opensource and is difficult to expand and adapt to our needs. During testing, Elastalert proved very promising, despite a few minuses (but these weren’t very critical):

  • It is written in Python. We like Python as a language for quickly writing “supporting” scripts, but don’t like to see it used in production as an end product.
  • It only lets you put together very rudimentary emails for the system to send out in response to events. For us, the look and convenience of emails is important, since we want others to use the system.

After playing around with Elastalert and examining its source code, we decided to write a PHP product with the help of our Platform Division. As a result, Denis Karasik Battlecat wrote a product designed to meet our requirements: it is integrated into our back office and only has the functionality we need.

For each rule, the system automatically generates the basic dashboard in Kibana and a link to it is included in the email. When you click on the link you will see the message and the graph for the time period stated in the notification.

Issues

At this stage, the first system release was ready for use. But issues cropped up more-or-less immediately.

Problem 1 (syslog + Docker)

The syslog daemon and the program usually communicate via the /dev/log Unix socket. As mentioned earlier, we put it into the container using standard features of Docker. This solution worked fine until we needed to reboot the syslog daemon.

Apparently, if you pass a particular file rather than a directory, when you delete or recreate a file on the host system, it will no longer be available inside the container. It turns out that any restart of the syslog daemon causes logs from Docker-containers to stop piping in.

If the entire directory is moved, however, then the Unix-socket canbe inside, and restarting the daemon will not break anything. But then the whole setup is complicated because libc expects the socket to be in /dev/log.

The second option that we considered was to use UDP or TCP to send out logs. But then the same problem would occur: libc is only able to write in /dev/log. We would have had to write our own syslog client and at this point we didn’t want to do that.

In the end we decided to launch a syslog daemon in each container and continue to write in /dev/log using the standard libc functions openlog()/syslog().

This wasn’t a big problem, because our system administrators use init system in each container anyway.

Problem 2 (blocking syslog)

In the devel-cluster, we noticed that one of the daemons was freezing intermittently. When we enabled the internal watchdog daemon, we got some backtrace that showed that the daemon was freezing in syslog()-> write ().

==== WATCHDOG ====
tag: IPC_SNAPSHOT_SYNC_STATE
start: 3991952 sec 50629335 nsec
now: 3991953 sec 50661797 nsec
Backtrace:
/lib64/libc.so.6(__send+0x79)[0x7f3163516069]
/lib64/libc.so.6(__vsyslog_chk+0x3ba)[0x7f3163510b8a]
/lib64/libc.so.6(syslog+0x8f)[0x7f3163510d8f]
/local/meetmaker/bin/meetmaker-3.1.0_2782 | shard1: running(zlog1+0x225)[0x519bc5]
/local/meetmaker/bin/meetmaker-3.1.0_2782 | shard1: running[0x47bf7f]
/local/meetmaker/bin/meetmaker-3.1.0_2782 | shard1: running(storage_save_sync_done+0x68)[0x47dce8]
/local/meetmaker/bin/meetmaker-3.1.0_2782 | shard1: running(ipc_game_loop+0x7f9)[0x4ee159]
/local/meetmaker/bin/meetmaker-3.1.0_2782 | shard1: running(game+0x25b)[0x4efeab]
/local/meetmaker/bin/meetmaker-3.1.0_2782 | shard1: running(service_late_init+0x193)[0x48f8f3]
/local/meetmaker/bin/meetmaker-3.1.0_2782 | shard1: running(main+0x40a)[0x4743ea]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f3163451b05]
/local/meetmaker/bin/meetmaker-3.1.0_2782 | shard1: running[0x4751e1]
==== WATCHDOG ====

Having downloaded the libc sources and looked at the implementation of the syslog client, we realised that the syslog() function was blocking and any delays on the rsyslog side affect the daemons.

Something had to be done, and the sooner the better. But we didn’t have the time …

A few days later we stumbled upon the most unpleasant surprise in modern architecture — a cascade failure.

Rsyslog is configured to “throttle” by default if an internal queue fills for some reason.

So we had a situation where one programmer didn’t notice that a test server had started sending a huge quantity of messages to the log. Logstash couldn’t cope with this influx: the main rsyslog queue overflowed and Logstash started reading messages from other rsyslogs very slowly. Because of this, other rsyslogs also overflowed and started to read daemon messages very slowly.

And daemons, as I said above, write in /dev/log synchronously and without timeout. The result was predictable: all the daemons that were writing in syslog at any significant frequency started to stall.

Another mistake was not telling the system administrators about the potential problem, meaning that it took more than an hour to root out the cause and disable rsyslog.

It turned out that weren’t the only ones to come across these issues. And not even just with rsyslog. Synchronous calls to the event loop of the daemon is an unattainable luxury.

We had a few options.

  • Leave syslog and go back to one of the other options that had one daemon writing to the disk, and a completely different daemon reading from the disk.
  • Continue to write in syslog synchronously, but in a separate thread.
  • Write our own syslog client and send data to the syslog using UDP.

The best option seemed to be the first. But we didn’t want to spend time on it and quickly got to work on option three.

As far as Logstash was concerned, two startup options solved all the issues: increasing both the number of processors and the number of lines processed concurrently (-w 24 -b 1250).

Plans for the future

In the near future we have plans to make a dashboard for our daemons. This dashboard will combine existing features with some new ones:

  • Display of daemon performance (so-called “traffic lights”) and basic statistics.
  • Graphs showing ERROR and WARNING lines in the logs.
  • Triggered alerts from the message system.
  • SLA monitoring, displaying the problematic services or requests.
  • Daemon stages. For example, what stage of loading it isin, the loading time, the duration of some periodic process etc.

This type of dashboard suits everyone, in my opinion: managers, programmers, administrators and those responsible for monitoring.

Conclusion

We built a simple system that collects logs of from all our daemons, letting us conveniently search through them and build graphs and visualization. It also emails us about any problems.

The fact that we were able to promptly discover a lot of problems that previously would not have been found at all, or only after some time, speaks to the success of the system, as does the fact that other teams have started using the infrastructure.

Concerning load: currently in the course of the day we get anywhere from 600 to 2000 lines per second, with periodic bursts of up to 10,000lines. The system handles this load without any problems.

The daily index size ranges from about ten to hundreds of gigabytes.

Some of you might focus on the system’s flaws or that some issues could have been circumvented if we had done something differently. This is, of course, true. But at the end of the day, we don’t program for the sake of programming. Our goal was achieved on a tight schedule and the system is so flexible that anything that ceases to be useful in the future can be easily improved or changed.

Marko Kevac, programmer in C/C++ Division

Shake Detector for Android with RxJava

$
0
0

It all began when I had the task of undoing a user action in the app when the device was shaken. The main problem was how to know that a shake had occurred. After a couple of minutes of searching, it became clear that one should subscribe to accelerometer events and then somehow try to detect shakes. Of course, there were some ready-made solutions for that. They were all quite similar, but none of them suited me so I wrote my own implementation. This was a class that subscribed to sensor events and changed its state with every event. After that, my colleagues and I fine-tuned the solution to avoid false positives, but as a result it began to look like something from a “Mad Max” movie. I promised that I would rewrite this mess when I had free time.

Recently I was reading articles about RxJava and remembered that task. Hmm, I thought, RxJava looks like a perfect tool for such a problem. Without thinking twice, I wrote a solution using RxJava. I was impressed by the result - the whole logic was only 8 (eight) lines of code! I decided to share my experience with other developers, and that’s how this article was born.

I hope that this simple example will help you decide whether to use RxJava in your projects. I will first explain how to setup the Android project with RxJava and then go through the development of a sample application step-by-step, explaining all the operators used. I am writing from the perspective that people reading this will have some experience with Android development itself, so the focus will be on using reactive programming.

The source code of the finished application is available on GitHub.

Let’s start!

Project setup

Adding RxJava dependency

To use RxJava, we should add these lines to the build.gradle:

dependencies{...compile'io.reactivex:rxjava:1.1.3'compile'io.reactivex:rxandroid:1.1.0'}

N.B: rxAndroid provides a Scheduler, which is bound to the UI thread.

Adding Lambdas support

RxJava is best when backed up with Lambdas. Without Lambdas, there is a lot of boilerplate code. There are two ways of adding Lambda support at the moment: using the Jack compiler from Android N Developer Preview or using the Retrolambda library. In both cases we should check that JDK 8 is installed first. I used Retrolambda in this example.

Android N Developer Preview

To use the Jack compiler from Android N Developer Preview, we can follow these instructions.

Add these lines to build.gradle:

android{...defaultConfig{...jackOptions{enabledtrue}}compileOptions{sourceCompatibilityJavaVersion.VERSION_1_8targetCompatibilityJavaVersion.VERSION_1_8}}

Retrolambda

To add the Retrolambda library to the project there are instructions by Evan Tatarka at https://github.com/evant/gradle-retrolambda

buildscript{...dependencies{classpath'me.tatarka:gradle-retrolambda:3.2.5'}}applyplugin:'com.android.application'applyplugin:'me.tatarka.retrolambda'android{compileOptions{sourceCompatibilityJavaVersion.VERSION_1_8targetCompatibilityJavaVersion.VERSION_1_8}}

N.B: Please note that in the original instructions Maven Central repository is recommended. You probably already have the JCenter repo in your project since it is used by default when a project is created by Android Studio. JCenter already contains all the required dependencies, so we should not add Maven Central.

Observable

So now we have all the tools, we can start development.

When you use RxJava, it all starts with getting an Observable. Let’s create a factory class that will create an Observable subscribed to sensor events, with the help of the Observable.create method:

publicclassSensorEventObservableFactory{publicstaticObservable<SensorEvent>createSensorEventObservable(@NonNullSensorsensor,@NonNullSensorManagersensorManager){returnObservable.create(subscriber->{MainThreadSubscription.verifyMainThread();SensorEventListenerlistener=newSensorEventListener(){@OverridepublicvoidonSensorChanged(SensorEventevent){if(subscriber.isUnsubscribed()){return;}subscriber.onNext(event);}@OverridepublicvoidonAccuracyChanged(Sensorsensor,intaccuracy){// NO-OP}};sensorManager.registerListener(listener,sensor,SensorManager.SENSOR_DELAY_GAME);// unregister listener in main thread when being unsubscribedsubscriber.add(newMainThreadSubscription(){@OverrideprotectedvoidonUnsubscribe(){sensorManager.unregisterListener(listener);}});});}}

Now we have a tool to transform events emitted by any sensor into an Observable. But which sensor fits our task best? In the screenshot below, the first plot is showing values from the gravity sensor TYPE_GRAVITY, the second plot - TYPE_ACCELEROMETER, the third plot - TYPE_LINEAR_ACCELERATION.
As you can see, the device was rotated smoothly and then shaken.

We are interested in events emitted by the sensor with type Sensor.TYPE_LINEAR_ACCELERATION. They contain acceleration values with Earth gravity already subtracted.

@NonNullprivatestaticObservable<SensorEvent>createAccelerationObservable(@NonNullContextcontext){SensorManagermSensorManager=(SensorManager)context.getSystemService(Context.SENSOR_SERVICE);List<Sensor>sensorList=mSensorManager.getSensorList(Sensor.TYPE_LINEAR_ACCELERATION);if(sensorList==null||sensorList.isEmpty()){thrownewIllegalStateException("Device has no linear acceleration sensor");}returnSensorEventObservableFactory.createSensorEventObservable(sensorList.get(0),mSensorManager);}

Reactive magic

Now that we have an Observable with acceleration events, we can use all the power of RxJava operators.

Let’s check what “raw” values look like:

createAccelerationObservable(context).subscribe(event->Log.d(TAG,formatTime(event)+""+Arrays.toString(event.values)));

This will produce output:

29.398[0.0016835928,0.014868498,0.0038280487]29.418[-0.026405454,-0.017675579,0.024353027]29.438[-0.032944083,-0.0029007196,0.011956215]29.458[0.03226435,0.022876084,0.032211304]29.478[-0.0011371374,0.022291958,-0.054023743]

As you can see, we have an event emitted by the sensor every 20ms. This frequency corresponds to the SensorManager.SENSOR_DELAY_GAME value passed as a samplingPeriodUs parameter when SensorEventListener was registered.

As a payload, we have acceleration values for all three axes but we’ll only use the X-axis projection values. They correspond to the gesture we want to detect. Some solutions use values from all three axes, so they trigger when the device is put on the table, for example (there is a significant acceleration for the Z axis when the device meets the table surface).

Let’s create a data class with only the necessary fields:

privatestaticclassXEvent{publicfinallongtimestamp;publicfinalfloatx;privateXEvent(longtimestamp,floatx){this.timestamp=timestamp;this.x=x;}}

Convert SensorEvent into XEvent and filter events with an acceleration absolute value exceeding some threshold:

createAccelerationObservable(context).map(sensorEvent->newXEvent(sensorEvent.timestamp,sensorEvent.values[0])).filter(xEvent->Math.abs(xEvent.x)>THRESHOLD).subscribe(xEvent->Log.d(TAG,formatMsg(xEvent)));

Now, to see some messages in the log we need to shake the device for the first time.

It’s really funny to see someone debugging the Shake Detection - they are constantly shaking their phone. You can only imagine what comes to my mind.

55.34719.03030255.36713.08437655.388-15.77554655.408-14.443999

We only have events with significant acceleration values for the X axis in the log.

Now the most interesting part begins. We need to track the moments when acceleration changes to the opposite direction. Let’s try to understand when this happens. Imagine that a hand with a phone is being accelerated to the left; the acceleration projection on the X axis has a negative sign. Then the hand begins to slow its motion and stops, the acceleration projection on the X axis has a positive sign. It means that one shake corresponds to one sign change of acceleration projection. Let’s form a so-called “sliding window”: actually it’s just a buffer that contains two values, the current one and a previous one:

createAccelerationObservable(context).map(sensorEvent->newXEvent(sensorEvent.timestamp,sensorEvent.values[0])).filter(xEvent->Math.abs(xEvent.x)>THRESHOLD).buffer(2,1).subscribe(buf->Log.d(TAG,getLogMsg(buf)));

And here’s our log:

[43.977-15.497713;44.01721.000145][44.01721.000145;44.03719.947767][44.03719.947767;44.05719.836182][44.05719.836182;44.07720.659754][44.07720.659754;44.098-16.811298][44.098-16.811298;44.118-15.6345]

Excellent, as we can see each event is now grouped with the previous one. We can easily filter couples of events with a different sign.

createAccelerationObservable(context).map(sensorEvent->newXEvent(sensorEvent.timestamp,sensorEvent.values[0])).filter(xEvent->Math.abs(xEvent.x)>THRESHOLD).buffer(2,1).filter(buf->buf.get(0).x*buf.get(1).x<0).subscribe(buf->Log.d(TAG,getLogMsg(buf)));


[53.888-16.762777;53.92820.83315][53.98819.87952;54.028-16.735554][54.089-16.46596;54.10921.682497][54.16920.355597;54.209-16.634022][54.269-16.122211;54.30921.806463]

Now every event corresponds to one shake. Only 4 operators are used and we can already detect rapid moves! But false triggering is still possible. Say the user was not shaking his device intentionally, but just took it in the other hand. There is a simple solution to avoid that - ask the user to shake the device several times during a short time period. Let’s introduce the parameters SHAKES_COUNT = number of shakes and SHAKES_PERIOD = the amount of time all shakes are to be made in. I have figured out that optimal values for these parameters are 3 shakes in 1 second. In other cases, some false triggering is possible or the user has to shake the device too hard.

So we want to detect the case when 3 shakes have been done within 1 second. Now we don’t need the values of acceleration, only the timestamp of each event is important. Let’s transform our buffered XEvents into timestamps of the last event in the buffer:

.map(buf->buf.get(1).timestamp/1000000000f)

The timestamp values in SensorEvent are in nanoseconds (really, really precise!), so I divide the value by 10^9 to get seconds. Now let’s apply again the familiar trick with a sliding window but this time with different params:

.buffer(SHAKES_COUNT,1).

In other words, for each event we’ll have an array containing that event along with two previous events. And, finally, we’ll filter only arrays that fit into 1 second:

.filter(buf->buf.get(SHAKES_COUNT-1)-buf.get(0)<SHAKES_PERIOD)

If an event has passed the last filter we know the user has shaken their device 3 times during 1 second. But let’s assume our dear user is over enthusiastic in shaking the device and continues to shakes it diligently. We will receive events on every subsequent shake, but want to detect only every 3 shakes. A simple solution for that is ignoring events for SHAKES_PERIOD after gesture detection:

.throttleFirst(SHAKES_PERIOD,TimeUnit.SECONDS)

It’s done! This Observable can now be used in our app. Here is the final code snippet:

publicclassShakeDetector{publicstaticfinalintTHRESHOLD=13;publicstaticfinalintSHAKES_COUNT=3;publicstaticfinalintSHAKES_PERIOD=1;@NonNullpublicstaticObservable<?>create(@NonNullContextcontext){returncreateAccelerationObservable(context).map(sensorEvent->newXEvent(sensorEvent.timestamp,sensorEvent.values[0])).filter(xEvent->Math.abs(xEvent.x)>THRESHOLD).buffer(2,1).filter(buf->buf.get(0).x*buf.get(1).x<0).map(buf->buf.get(1).timestamp/1000000000f).buffer(SHAKES_COUNT,1).filter(buf->buf.get(SHAKES_COUNT-1)-buf.get(0)<SHAKES_PERIOD).throttleFirst(SHAKES_PERIOD,TimeUnit.SECONDS);}@NonNullprivatestaticObservable<SensorEvent>createAccelerationObservable(@NonNullContextcontext){SensorManagermSensorManager=(SensorManager)context.getSystemService(Context.SENSOR_SERVICE);List<Sensor>sensorList=mSensorManager.getSensorList(Sensor.TYPE_LINEAR_ACCELERATION);if(sensorList==null||sensorList.isEmpty()){thrownewIllegalStateException("Device has no linear acceleration sensor");}returnSensorEventObservableFactory.createSensorEventObservable(sensorList.get(0),mSensorManager);}privatestaticclassXEvent{publicfinallongtimestamp;publicfinalfloatx;privateXEvent(longtimestamp,floatx){this.timestamp=timestamp;this.x=x;}}}

Usage

In my example I play a sound when a shake gesture is detected. Let’s add a field in the Activity class:

privateObservable<?>mShakeObservable;

Initialise it in the onCreate method:

@OverrideprotectedvoidonCreate(BundlesavedInstanceState){super.onCreate(savedInstanceState);setContentView(R.layout.activity_main);mShakeObservable=ShakeDetector.create(this);}

Subscribe to the onResume method:

@OverrideprotectedvoidonResume(){super.onResume();mShakeSubscription=mShakeObservable.subscribe((object)->Utils.beep());}

And don’t forget to unsubscribe in onPause:

@OverrideprotectedvoidonPause(){super.onPause();mShakeSubscription.unsubscribe();}

That’s it!

Conclusion

As you can see, we were able to create a solution in just a few lines of code that detects a given shake gesture. It is compact and easy to read and understand. You can compare this with regular solutions, e.g. seismic by Jake Wharton. RxJava is a great tool and when properly applied great results can be achieved. I hope this article will give you the impulse to learn RxJava and use reactive principles in your projects.

Let the stackoverflow.com be with you!

Arkady Gamza, Android developer.

Segregating Android Phones between Docker Containers

$
0
0

We wanted to move our Android device tests to a Linux host: it’s cheaper hardware, and we find that our Mac Mini build machines tend to fumble Android USB connections, making phones mysteriously vanish in the middle of a test run. We mostly use Docker containers to manage our Linux servers, and we decided to try to build an Android test container that could test with real phones, cloned once for each model/group of phones, so it would fit into the existing server scheme.

A quick sidebar: one of the benefits of running on Linux over running on Mac was that because it’s a more open system, it showed us one of the causes of the phones’ mysterious disappearance during the tests: disconnections lasting a fraction of a second. This allowed us to patch our test layer, adding a retry in the right place which has resolved pretty much all of our remaining problems in that regard. I will be encouraging my colleague to write that up shortly.

Docker

Docker is a system that combines a means of building and distributing software configurations together with an operating-system framework that keeps each ‘container’ of software isolated from the rest of the computer: separate filing system, separate process space, etc. Container processes share the same operating system instance, but the operating system is a lot more strict about who can talk to what than usual, so the overall effect is similar to a set of virtual machines.

Clarifying diagrams from Docker’s website:

A VM system runs other OSs on top of the host OS: VM

A Docker system runs containers on top of one OS: Docker

Segregating adb/adbd

We wanted each container to control its own set of phones. The most natural way of doing this was to assign each group of USB sockets to a different container - devices plugged into the computer’s front panel appear in the directory /dev/bus/usb/001, so we allow container 1 to see that directory; devices plugged into the back panel appear in /dev/bus/usb/002, so container 2 is allowed to see that directory, and we ordered an expansion card for more connections.

So far, so good, but Android’s ADB command talks to the phones through a daemon on the default port 5037 which is machine-wide, so the first container to run adb would start the adb daemon (adbd) and cause all the other containers to connect to that daemon and see the first container’s phones. This could have been solved with docker networking (each docker container gets its own IP, and hence its own set of ports), but it suited us to use a different mechanism: each container was configured with a different value of the environment variable ANDROID_ADB_SERVER_PORT. We allocated a port to each container so each container starts its own adb daemon, which can only see that container’s own phones.

While developing this, we found that we needed to be careful not to run ‘adb’ at the host-machine level without setting ANDROID_ADB_SERVER_PORT, because a host-level adbd that could see all the USB ports would ‘steal’ phones from the Docker containers: phones can only talk to one ‘adbd’ at a time.

If we were only using emulators, separate adbd processes would suffice. However, we use real devices, so…

Updating containers with hot-plugged USB devices

The second problem - and the main reason for writing this article - was that when a phone was rebooted as part of our normal build process, it vanished from the container’s file system, and hence its list of phones, and never came back!

On the host machine, you can see phones being added and removed by keeping a look at the files in /dev/bus/usb: the system creates and deletes files to match the phones:

while sleep 3; do
    find /dev/bus/usb > /tmp/a
    diff /tmp/a /tmp/b
    mv /tmp/a /tmp/b
  done

Unfortunately, not only do these creations and deletions not happen within the Docker containers, but even if you set things up to create and delete those nodes, the nodes you create don’t actually talk to the phones!

The sledge-hammer we used to resolve this issue was putting our containers in --privileged mode, and letting them see the whole /dev/bus/usb directory as the host machine sees it.

Now we needed a different mechanism to segregate the phones by bus. I downloaded the Android source, and trivially patched platform/system/core/adb/usb_linux.cpp

std::stringbus_name=base+"/"+de->d_name;+constchar*filter=getenv("ADB_DEV_BUS_USB");+if(filter&&*filter&&strcmp(filter,bus_name.c_str()))continue;std::unique_ptr<DIR,int(*)(DIR*)>dev_dir(opendir(bus_name.c_str()),closedir);if(!dev_dir)continue;

Each container was given a different ADB_DEV_BUS_USB value to denote the bus it should pay attention to.

Aside: although the patch was trivial, building abd required trial and error, because most people want to build everything. My final recipe was this (in a case-sensitive filesystem - my work laptop is a mac):

cd src/android-src
source build/envsetup.sh
lunch 6
vi system/core/adb/usb_linux.cpp
JAVA_NOT_REQUIRED=true make adb
out/host/linux-x86/bin/adb

Multiplexing USB ports

So far so good, but when we installed our USB expansion card we found that there was only one USB bus on it, taking our machine to three buses, whereas we had five groups of devices we wanted to segregate.

Having been inside ADB’s source code already, I decided simply to add another environment variable: ADB_VID_PID_FILTER takes a list of vid:pid pairs, and makes adb ignore any device that doesn’t match.

The patch is below. There may be a slight race condition, when multiple adbd processes listening to the same USB bus try to scan the phones, but in practice this hasn’t proven to be a problem.

diff--gita/adb/usb_linux.cppb/adb/usb_linux.cppindex500898a..92e15ca100644---a/adb/usb_linux.cpp+++b/adb/usb_linux.cpp@@-115,6+115,71@@staticinlineboolcontains_non_digit(constchar*name){returnfalse;}+staticintiterate_numbers(constchar*list,int*rejects){+constchar*p=list;+char*end;+intcount=0;+while(true){+longvalue=strtol(p,&end,16);+//printf("%d, %p ... %p (%c) = %ld (...%s)\n", count, p, end, *end, value, p);
+if(p==end)returncount;+p=end+1;+count++;+if(rejects)rejects[count]=value;+if(!*end||!*p)returncount;+}+}++int*compute_reject_filter(){+char*filter=getenv("ADB_VID_PID_FILTER");+if(!filter||!*filter){+filter=getenv("HOME");+if(filter){+constchar*suffix="/.android/vidpid.filter";+filter=(char*)malloc(strlen(filter)+strlen(suffix)+1);+*filter=0;+strcat(filter,getenv("HOME"));+strcat(filter,suffix);+}+}+if(!filter||!*filter){+return(int*)calloc(sizeof(int),1);+}+if(*filter=='.'||*filter=='/'){+FILE*f=fopen(filter,"r");+if(!f){+if(getenv("ADB_VID_PID_FILTER")){+// Only report failure for non-default value
+printf("Unable to open file '%s'\n",filter);+}+return(int*)calloc(sizeof(int),1);+}+fseek(f,0,SEEK_END);+longfsize=ftell(f);+fseek(f,0,SEEK_SET);//same as rewind(f);
+filter=(char*)malloc(fsize+1);// Yes, it's a leak.
+fsize=fread(filter,1,fsize,f);+fclose(f);+filter[fsize]=0;+}+intcount=iterate_numbers(filter,0);+if(count%2)printf("WARNING: ADB_VID_PID_FILTER contained %d items\n",count);+int*rejects=(int*)malloc((count+1)*sizeof(int));+*rejects=count;+iterate_numbers(filter,rejects);+returnrejects;+}++staticint*rejects=0;+staticboolreject_this_device(intvid,intpid){+if(!*rejects)returnfalse;+for(intlen=*rejects;len>0;len-=2){+//printf("%4x:%4x vs %4x:%4x\n", vid, pid, rejects[len - 1], rejects[len]);
+if(vid==rejects[len-1]&&pid==rejects[len])returnfalse;+}+returntrue;+}+staticvoidfind_usb_device(conststd::string&base,void(*register_device_callback)(constchar*,constchar*,unsignedchar,unsignedchar,int,int,unsigned))@@-127,6+192,8@@staticvoidfind_usb_device(conststd::string&base,if(contains_non_digit(de->d_name))continue;std::stringbus_name=base+"/"+de->d_name;+constchar*filter=getenv("ADB_DEV_BUS_USB");+if(filter&&*filter&&strcmp(filter,bus_name.c_str()))continue;std::unique_ptr<DIR,int(*)(DIR*)>dev_dir(opendir(bus_name.c_str()),closedir);if(!dev_dir)continue;@@-176,6+243,12@@staticvoidfind_usb_device(conststd::string&base,pid=device->idProduct;DBGX("[ %s is V:%04x P:%04x ]\n",dev_name.c_str(),vid,pid);+if(reject_this_device(vid,pid)){+D("usb_config_vid_pid_reject");+unix_close(fd);+continue;+}+// should have config descriptor next
config=(structusb_config_descriptor*)bufptr;bufptr+=USB_DT_CONFIG_SIZE;@@-574,6+647,7@@staticvoidregister_device(constchar*dev_name,constchar*dev_path,staticvoiddevice_poll_thread(void*){adb_thread_setname("device poll");D("Created device thread");+rejects=compute_reject_filter();while(true){// TODO: Use inotify.
find_usb_device("/dev/bus/usb",register_device);

I hope all this saves you some time, if you’re engaged in a similar project. Feel free to ask for clarifications in the comments below.

Windows Phone - an experimental platform

$
0
0

In the last three years our team has tripled in numbers, and is now newly focused on becoming the company’s experimental platform. In this article I will tell you the story of our team: how our Windows Phone (WP) team became an experimental platform, the problems that we faced and how we solved them.

We have created really useful solutions that have helped us implement 104 new features (including A/B/C/D tests) in the last 6 months, with only 6 people in the team.

Warning! Knowledge and solutions from this article might affect your workflow.

Reborn

Most companies don’t support the Windows Phone platform because Windows Phone’s market share is much smaller than iOS and Android.

However, Windows Phone is the third largest mobile operating system on the market. And in our company it has been there from the beginning. But not much attention was paid to it, compared with iOS and Android.

We had an outdated application based on Silverlight and for a long time we had no dedicated team for this project.

In 2014 Timur, a great Windows Phone developer, joined the company to support the app and my role was to test it.

We discovered that the code needed a lot of work which meant changing the whole application structure. A lot of time was spent supporting it and so we decided to write a completely new application.

Reborn: part 1 “Hot or Not”

Our new project was called Hot or Not, a similar app to Badoo but much simpler. Timur and I made the entire Hot or Not application in just three months, with just one GIT branch which was eventually merged with the Master branch and we also used TeamCity as a build machine. Every change that was made by the developer to the GIT branch was automatically built as a new version of the application and we knew the history of these builds.

Reborn: part 2 - “Return of the Jedi”

After a couple of Hot or Not releases, we started to build a brand new application of Badoo.

Having previously done some work for Hot or Not, we decided to make two applications with a shared core in the same GIT repository (in the same Visual Studio project). We had the opportunity to develop two mobile applications at the same time. However, in such a scheme, any mistakes that were made on Hot or Not automatically transferred to Badoo and vice versa.

This doubled my workload and responsibility because the functionality was different and the bugs behaved differently. We were still supporting Windows Phone 7 and 8, and because of Visual Studio specifics it was like having two completely different applications. Have you ever tested four mobile applications at the same time? I can tell you it’s not easy!

It was finally time for us to improve our own workflow! First of all, we took a look at our huge mobile client teams and thought about the GIT flow that was used on iOS and Android teams; it’s what we call a huge system.


Git flow

In the beginning you have two branches: Master and Dev.

1st step - Feature Branches Testing The Developer implements a new cool feature in a new branch that was created from the Dev branch. He resolves it and passes to QA. If the Tester finds a bug, the task is reopened and the developer fixes it. When the task passes QA, this feature branch merges back into the Dev branch.

2nd step - Integration Testing Thus, 5 to 10 feature branches are merged into the Dev branch. The Integration branch is then cut from the Dev branch. During this stage we are looking for bugs caused by integrating features with each other. Bugs found will be rated as high priority and fixed. When all tasks have been tested and the regression testing is finished, this branch merges into the Dev branch, and then into the Master branch.

3rd step - Release Branch Imagine you have five integrations in your sprint: this means that 50 tasks will go live in the next release and you will need time for release testing.

Release branch is cut from the Master branch and this step is quite similar to Integration testing, but in this case you have all the tasks from release. When each of the 50 tasks passes QA, release testing is done, and you’ve released a new version of the application, and the release branch merges into Dev and then into Master.

However, the fact is that GIT workflow only works well for big teams while we only had two people in our team. We started to rethink this and figured out two main restrictions that our best flow should have:

  • QAs must not block developers and vice versa. We shouldn’t block each other or have to switch to different tasks because this creates a situation where only half the team is working.
  • Each task must be tested several times (ideally by different people). Two stages of testing were enough in our case.

We concluded that the best flow for us will be a typical GIT flow without one of the testing steps. We simply removed Feature Branches testing and named it Windows Phone flow.

With such a great workflow, we were able to develop a brand new Badoo application in just three months.

Before moving to the second part of my story I would like to take you through some of our experiments.

Experiments

The best platform to conduct experiments on is the WEB and at Badoo we do a lot of experiments on our desktop website. Why are we able to do that? Because our team can develop and deliver to users really fast. We do proper continuous integration for our web platform, which means we are deploying new builds twice a day worldwide in under a minute.

But since mobile platforms grow and have a lot of users, our company focus shifted and mobile clients became our major platform. However, this change created problems, because experiments that we’d previously tested on websites sometimes didn’t work for mobile and so our product managers decided that one of our mobile client teams would become an experimental platform.

This left one question: Who from the WP, iOS or Android team should form this platform?

Small number of users

3% of our mobile client users are on Windows phone, but this actually represents hundreds of thousands of users, which means we have enough users to create test groups and conduct experiments. Our experiments aren’t always the usual A/B tests but A/B/C/D tests as well, which is where we have four groups including the control group.

It also means that if we roll out something that doesn’t work, it doesn’t affect the activity of the majority of our users.

Short review time

Unlike iOS, where the review time from Apple takes up to two days, the Microsoft Store take around 30 minutes to review and only 16 hours till it’s available to download worldwide, according to Microsoft. In my experience, it only took two hours for it to be available in all the main markets.

No big device fragmentation

WP doesn’t have much device or OS fragmentation compared to Android because of Microsoft’s strict restrictions. Windows Phone doesn’t allow custom versions of their OS, and the type of hardware that can be used in mobile devices running their OS has to comply with their minimum specs. This is why we have much fewer device- and OS-specific bugs.

The Windows Phone platform was made after iOS and Android platforms were launched. Therefore, one big difference in the Windows Phone OS is their “Metro” design, which is all about tiles. This means that the design is really simple.

Things that were rounded on other platforms are square on WP and there are almost no complex animations, which makes development much easier and faster. On the other hand, it is still a mobile OS with a touch screen, push notifications and other standard mobile OS stuff.

Revolution

In 2015 a new team member joined us to help speed up the process of implementing even more new features. We then became an experimental mobile platform. In just one year we went from being an abandoned platform to being the most innovative platform in our company!

Now we implement and test new ideas quickly. If everything goes well and all the metrics go up, this will become a new feature and will be implemented on bigger mobile platforms such as iOS and Android. However, if the new idea fails during testing, our product managers will try to change it and if it doesn’t work, it will simply be removed. In just six months we’ve implemented 104 new features (including A/B/C/D tests).

In the year and a half after we started, we encountered a lot of interesting problems. First of all, we had to implement all the experiments really quickly, which meant we had to optimise some of our workflow processes.

Stats

If you conduct an experiment, you want results. For us, the only way to understand the impact of a new feature is to collect statistics about it and analyse the results of its performance.

How do we start?

We have a dedicated Business Intelligence team that has developed our own statistics tool called HotPanel. Thanks to HotPanel we have a fully configurable statistic tracking device, which meets the requirements from the Product team and we can develop it to fit our needs.

We also use AppAnnie to see our users’ reviews from the app stores and monitor complaints so we can quickly find and fix problems.

After we’ve released the app or feature, one of the most important metrics we measure is the crash rate. If you don’t know how it is going, your update may fail, user activity may decrease, and you will not understand whether the cause was crashes or something else. Thanks to HockeyApp we can collect crash logs from real users.

Go go go

Now that we’ve finally understood the impact of our experiments and know how the features perform (or not), our next step improves the speed with which we implement new features. To speed up this process you can of course add more members to the team, however it is more effective to improve your current processes.

When you test a server-side based application you find bugs. However, it’s unclear if the problem is with the server code or the client code. During the investigation you will also need to consult the server logs.

To simplify this task, we created a special and very comfortable interface for viewing server logs. You can sort the logs by user ID, device ID, logs in real-time view or the history of logs.

You may have problems preparing test environments or test accounts.

Example:

You have to test an email that is going to be sent a week after registration. Your real test will be:

  1. Register a new account
  2. Wait for a week
  3. Receive this email
  4. Check it But this is not a productive use of QA resources.

The first thing that comes to mind to solve this problem is to go to the developer and kindly ask him to change something in the database to make the email a bit faster. But we all know that disturbing developers is not always the best idea. So, to make everybody happy we built a new tool that we called QAAPI, an interface with a lot of API methods.

In the email testing example, QA should just use one QAAPI method to get the email in a second. Our developer Dmitry made QAAPI, but this is a topic for another article.

Usually, the server develops faster than mobile clients, but because of our rapid development and testing pace, the mobile client is sometimes ready even before the server has started working on the task. To be able to develop and test new functionality, the client should somehow get the correct server responses for the new client requests to server. To do this we created the Server Mock utility - the idea is simple, developers or QA can create a special server environment where they can mock an exact server response on an exact client request.

Also, it helps to test situations when the server is broken, as we can mock any server response.

These things are possible thanks to a very good client-server protocol developed by each of the dev teams before client or server development has started.

Sometimes it will be faster to ask client-side developers to simplify your testing by creating some “extra” abilities inside the application, such as:

  • Being able to see the client log, sending it via email.
  • Opening any screen of the application, which may be complicated because the server controls it, etc…

For such “tweaks” we’ve got a dedicated section in our application, which is called the Debug_menu. You can also clear the photo cache, overall cache or even crash an application and much more.

Sometimes people from other teams need test versions of our application for various reasons, like:

  • Product manager wanting to check the function of unreleased features.
  • Or designers wanting to do a visual QA.
  • Maybe a translator wants to install an application to see what the translation text looks like.

To make installing test applications easier, each mobile client team has their own internal app store. This allows us to install test builds in just one click.

More people

Because developing and supporting test cases takes a lot of time, we don’t have them.

In early 2016 our WP QA team grew by three additional people, which presented a problem - I had to share my testing notes. The idea was to share all notes that we stored privately on our notebooks which would help us during testing.

We tried lots of software like Google Docs but the main problem was that it’s hard to search files in folders among other docs you may have. We also tried Evernote but it had a serious disadvantage - you have to share each note you create.

As we are the Windows Phone team, we decided to try out Microsoft OneNote. This suits our needs, and it even allows us to deep link to the exact note in native applications! So we started using a shared notebook and created really good testing documentation. We also created a special regression checklist which contains links to our brilliant shared notes. If we had a new joiner today, we would just need to give him this document and it would almost be enough for him to start testing applications properly.

Today, we have a weekly release cycle. Our main and only rule is that our release branch should be created and tested (excluding regression) one day before release - this allows us to publish our application for Windows Phone during working hours.

Conclusion

Being a team of 6 people is an advantage: we’re more flexible and free to improvise and we can take decisions, which is not the case in a big team. The size of your team doesn’t matter, because if you know how to get the best out of each team member and improve your processes, you can reach optimum results.

I hope that by sharing these experiences with you, I can help you and your team solve similar challenges. I’m very curious to hear what you think about our experiment, so please let me know!

Vyacheslav Loktik, QA Lead Windows Phone

Viewing all 132 articles
Browse latest View live