Examples
JASPAR is a very highly-touted transcription factor motif database from which motif count
matrices can be downloaded for a large variety of organisms and transcription factors. There exist numerous other motif
databases as well (TRANSFAC, CIS-BP, MEME, HOMER, WORMBASE, etc), most of which use a relatively similar format
for their motifs. Typically, a motif file consists of four rows or columns with each position in a given row or column corresponding
to a base within the motif. Sometimes there is an comment line started with >
. The row or column order is always A, C, G, T
.
In this example, the motif consists of four rows corresponding to the 16 positions of the motif with counts for each base at each position.
>>> path = "MA0007.1.pfm"
>>> ! cat MA0007.1.pfm
- >MA0007.1 Ar
9.00 9.00 11.00 16.00 0.00 12.00 21.00 0.00 15.00 4.00 5.00 6.00 3.00 0.00 4.00 11.00 1.00 3.00 6.00 6.00 10.00 5.00 7.00 2.00 3.00 1.00 0.00 6.00 2.00 24.00 0.00 9.00 11.00 9.00 5.00 0.00 0.00 5.00 22.00 16.00 7.00 5.00 11.00 11.00 2.00 3.00 3.00 7.00 22.00 1.00 0.00 0.00 8.00 2.00 2.00 9.00 1.00 24.00 0.00 1.00 1.00 1.00 5.00 9.00 0.00 6.00 6.00 10.00 7.00 0.00 2.00 5.00 1.00 0.00 1.00 9.00 6.00 0.00 15.00 0.00 20.00 7.00 0.00 4.00 6.00 4.00 3.00 2.00
>>> import tfm_utils
>>> m = tfmp.create_matrix("MA0045.pfm")
>>> tfmp.score2pval(m, 8.7737)
9.992625564336777e-06
>>> tfmp.pval2score(m, 0.00001)
8.773708000000001
This could also be done using pandas as follows
>>> import pandas as pd
>>> import tfm_utils
>>> df = pd.read_csv("tests/M08490_1.94d.txt", sep = "\t", index_col=0)
>>> df.head()
A C G T
Pos
1 0.215492 0.220404 0.340647 0.223457
2 0.534211 0.101312 0.330926 0.033551
3 0.000000 0.000191 0.000286 0.999523
4 0.014867 0.000000 0.756531 0.228601
5 0.999333 0.000000 0.000000 0.000667
>>> matrix = tfm_utils.df_to_matrix(df)
>>> matrix
<tfm_utils.pytfmpval.Matrix; proxy of <Swig Object of type 'Matrix *' at 0x7fc2f8bfa330> >
>>> tfm_utils.score2pval(matrix, 7.14)
3.516674041748047e-06
This also works by passing the DataFrame into the functions directly and works for any orientation:
>>> df.head()
Pos 1 2 3 4 5 6 7 8 9 10 11 12
A 0.215492 0.534211 0.000000 0.014867 0.999333 0.000000 0.03061 0.0 0.226232 1.0 0.035516 0.221931
C 0.220404 0.101312 0.000191 0.000000 0.000000 0.968450 0.00000 0.0 0.758146 0.0 0.328779 0.327673
G 0.340647 0.330926 0.000286 0.756531 0.000000 0.000508 0.96939 0.0 0.000000 0.0 0.101733 0.227845
T 0.223457 0.033551 0.999523 0.228601 0.000667 0.031042 0.00000 1.0 0.015622 0.0 0.533973 0.222551
>>> tfm_utils.score2pval(df, 7.14)
5.960464477539063e-08
If you are more used to R, this could also be done by creating a string for the matrix by concatenating the rows (or columns) and using the read_matrix()
function.
This method is usually easier, as it allows the user to parse the motif file as necessary to ensure a proper input. It’s also more fitting for high-throughput use.
>>> from tfm_utils import tfmp
>>> mat = (" 3 7 9 3 11 11 11 3 4 3 8 8 9 9 11 2"
... " 5 0 1 6 0 0 0 3 1 4 5 1 0 5 0 7"
... " 4 3 1 4 3 2 2 2 8 6 1 4 2 0 3 0"
... " 2 4 3 1 0 1 1 6 1 1 0 1 3 0 0 5"
... )
>>> m = tfmp.read_matrix(mat)
>>> tfmp.pval2score(m, 0.00001)
8.773708000000001
>>> tfmp.score2pval(m, 8.7737)
9.992625564336777e-06