We often have opportunities for students to either work solely on writing their thesis project (part of their MSc curriculum), on internal projects, or on a combination of both.
To give you an idea of the topics we offer, you can read more about the topics of the previous college year here:
Pointlogic uses probabilistic programming extensively for performing various types of analytical inference; currently that inference is performed in Stan. Probabilistic programming is a tool for statistical modeling, closely linked to performing Bayesian Statistics - it seeks to describe probabilistic models and then perform inference in those models.
One of our current technical challenges is that our models are computationally expensive and we do not have satisfactory options for reducing the run time. The goal of this project is to evaluate the various algorithms and technical frameworks available for performing probabilistic programming.
We have a few potential avenues that we would like to explore but are open to other ideas as well:
The goal of this project is to review a current component of modelling using survey data at Pointlogic. Our respondent level brand health models almost entirely use survey data as fuel for the modelling – the problem with this is that when modelling using survey data we don’t have true measured information around when an individual has been exposed to advertising (except for cases where digital ads are tagged). We have a methodology in-place to help us here, Contact Estimation, which involves combining a number of pieces of information; such as a respondent’s typical media consumption behavior and where and when an advertiser advertised. This information is combined to provide an estimate for the expected number of contacts that an individual will have had, broken down by media channel
We have many potential areas for exploration here:
In many projects, Pointlogic uses dimension reduction in order to reduce the number of features used in a model. The most common technique used is principal components. In this project, we will use machine-learning methods (e.g. auto-encoders) to do dimension reduction and compare the performance with commonly used methods.
An important type of projects within Pointlogic is ‘data fusion’. As explained below, dimension reduction is an important step and we would like to examine the added value machine learning can bring to our methodology.
Data fusion is a method of integrating data sets using statistical analytics and modeling in order to create a single data set that incorporates the attributes from both underlying sources. The ‘integrated’ dataset allows us to make analysis that require variables from both sides. For media measurement (panel) integration, the following example of a 1-1 fusion is illustrative:
To create Data Set 3, we identify like-variables within both Data Set 1 and Data Set 2, which are called linking variables. Using statistical analysis to understand how each of these variables correlates to both viewing of television and listening behavior, we establish importance weights for each variable, including which variables must match exactly between both data sets (always age/gender and various demographics traits).
When the linking variables and their importance are established, we can define a distance function between respondents. The next step, linking donors to recipients while minimizing distances is known in Operations Research as a generalized assignment problem and it is well studied. Standard algorithms can solve the assignment problem very efficiently. The combined dataset can be used to estimate the overlap between TV stations and radio stations, or the joined reach of a cross-media campaign.
In many countries, Nielsen is responsible for providing TV ratings data: information on how many people watched specific shows on television. These TV ratings are considered a ‘currency’ as advertisers buying spots for TV advertising pay based on the amount of people watching during that time. Therefore having robust, reliable ratings is crucial for Nielsen and the TV buying market. TV ratings are measured via a panel study; the size of this panel varies by country.
The aim of this project is to identify individuals that demonstrate not natural viewing behavior with the use of sophisticated indicators that should be set. Specifically the task is to define a scientific methodology and tools to detect suspicious individuals. Based on this set objective indicators to conclude about proved false behavior of individuals that should be excluded from the panel.
Considerations to be taken into account: instructions given to individuals by third parties to watch something could be not constant in time and could be different for different channels, could be quite sophisticated aiming not to be revealed in easy way.
The aim is to explore opportunities of machine learning capabilities and various statistical methods. Available information is viewing data for each individual by each channel and listing of demographic characteristics on daily base. Viewing data is available on viewing statements level. Analysis should be performed along whole period of panel existence with regular periodicity in order to detect false behavior individuals in-time manner upon their appearing. The moment when household is disclosed and contacted by third parties with instruction to watch something in specific is not known, it could be in the beginning of household inclusion into the panel or any day during the whole period of presence in the panel.
The goal of this project is to apply particle swarm optimization to a large optimization problem. The optimization problem has 10,000’s decision variables (real-valued), linear restrictions and a non-linear objective function, that is very computational demanding to evaluate. The application is about planning an advertising campaign. The decision variables correspondent to budgets allocated to a large set of media and the objective function represents the ROI of the campaign.
The main reason why we are interested in particle swarm optimization is that it can take advantage of a parallel computing environment. Working on this project, you will be implementing and tuning a number of variations of the algorithm, and evaluating these on a benchmark set of optimization problems.
Surveys about media consumption behavior are the core of many of our products. The biggest and most time-consuming part for respondents are questions related to the reach and frequency of specific media channels. Due to the large amount of channels, we are forced to ask respondents only about a subset of channels. Experiments show that we are able to collect much more/accurate information when asking people to assess group behavior rather than individual behavior.
The goal of this project is to combine two interviewing techniques, which will allow respondents to describe their media behavior in a non-traditional way.
By doing so we can assess respondents’ meta-knowledge about the population distribution rather than their own and correct for individual errors. This project includes combining these two treatments in a two by two experimental design and comparing the results with data collected by our traditional questionnaire design.
*Note: this project requires primary data collection and funding, these considerations can be done only after submitting a complete research proposal.
If you are interested in a thesis internship, feel free to apply. We would like to find out if you are up for this challenge, so please include your CV, motivation letter and grades list.