README.rst 8.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196
  1. Contex - Contextual string manipulation
  2. =======================================
  3. Abstract
  4. ---------
  5. This package provides ``contex.rules``, an interface which enables a very declarative form of string
  6. manipulation, where you can manipulate a string "in one go" in sophisticated ways.
  7. This library also provides two related abstractions, ``StringContext`` and ``MatchContext``, which
  8. can be used for a more stateful manipulation of strings. I recommend using ``contex.rules`` as I
  9. think that makes for more readable code. Nevertheless, those abstractions are well
  10. documented and might usefully serve as building blocks. Indeed, ``contex.rules`` is implemented on
  11. top of them.
  12. The problem with our interfaces for string manipulation
  13. -------------------------------------------------------
  14. My motivation for creating this package was that I was assigned a task in which it was necessary to
  15. change strings such as ``'1_Photo032-2008.jpg'`` into ``'1_Photo031-2008.jpg'``. All the numbers could vary
  16. between filenames, and it seemed like I always had to do something inelegant to accomplish this task. Maybe
  17. it was to match the various parts and stich them back together:
  18. .. code-block:: python
  19. >>> match = re.fullmatch('(\d+)_Photo(\d+)-(\d+)\.jpg', '1_Photo032-2008.jpg')
  20. >>> '{}_Photo{}-{}.jpg'.format(match.group(1), '{:0>3}'.format(int(match.group(2))-1), match.group(3))
  21. '1_Photo031-2008.jpg'
  22. Or using ``re.sub`` with non-consuming regex groups to match the correct area of the string:
  23. .. code-block:: python
  24. >>> re.sub('(\d+)(?=-\d+\.jpg)', lambda m: '{:0>3}'.format(int(m.group(1))-1), '1_Photo032-2008.jpg')
  25. '1_Photo031-2008.jpg'
  26. Shouldn't this be simpler? Describing that string with a regular expression is simple enough, and I'm
  27. only changing one little part of the string, so why do I have to fiddle around with indices, and why do
  28. I have to sacrifice readability? Most importantly, why do I have to experience this aesthetic pain deep
  29. in my heart?
  30. First attempt: stateful manipulation
  31. ------------------------------------
  32. My first idea was that our abstractions aren't fit for this sort of problem. Strings are flat, they
  33. have no sense of context, and if you pull out a substring then it requires special effort to stich it
  34. back together. The solution? Just keep track of the ``before`` and the ``after``:
  35. .. code-block:: python
  36. >>> view = contex.match('1_Photo032-2008.jpg', '\d+_Photo(?P<number>\d+)-\d+\.jpg')
  37. >>> view
  38. <MatchContext object; tup=('', '1_Photo032-2008.jpg', '')>
  39. >>> view.group('number')
  40. <MatchContext object; tup=('1_Photo', '032', '-2008.jpg')>
  41. >>> result = view.group('number').replace(lambda n: '{:0>3}'.format(int(n)-1))
  42. >>> result
  43. <MatchContext object; tup=('1_Photo', '031', '-2008.jpg')>
  44. >>> str(result)
  45. '1_Photo031-2008.jpg'
  46. >>>
  47. This way I can move around the "focus point" of the string with methods such as ``.group``, manipulate that space,
  48. and when I'm done convert it back to a ``str``. I can even manipulate more than one area of the string:
  49. .. code-block:: python
  50. >>> view = contex.match('1_Photo032-2008.jpg', '\d+_Photo(?P<number>\d+)-(?P<year>\d+)\.jpg')
  51. >>> view.group('number').replace('').group('year').replace(lambda y: y[-2:])
  52. <MatchContext object; tup=('1_Photo-', '08', '.jpg')>
  53. >>>
  54. ``MatchContext`` keeps track of where the matched regular expression groups are: Even though I removed the
  55. content of the "number" group, ``MatchContext`` knows where to find and replace the "year" group. It can also
  56. deal with nested regex groups, 0-length matches etc.
  57. .. note::
  58. Previously (v2.0.1 and earlier) I allowed arbitrary slicing on ``MatchContext`` objects to select the focus
  59. point in addition to the ``.group`` method. This was a mistake. When you're dealing with 0-length slices and
  60. adjacent regex groups that matched 0-length strings, there arises serious problems of semantics. I found out
  61. that the expected semantics is inextricably linked to which regex group you previously selected with ``.group``,
  62. and therefore had to disallow slicing for ``MatchContext`` objects.
  63. Removing the state: Vive la Revolution
  64. --------------------------------------
  65. The ``MatchContext`` abstraction certainly is an improvement for these particular types of problems, but
  66. there is one downside to it, and that is that it adds an additional layer of state to ordinary strings:
  67. The programmer must remember which part of the string is in "focus", or, in other words, which state the
  68. string is in.
  69. So my next challenge was to eliminate the state. What I found out was that only in rare cases is the state
  70. needed or useful, and this lead me to believe that the fundamental problem isn't really the abstractions we
  71. use for representing strings, but rather the interfaces we have for manipulating them. Thus, pardon the pun,
  72. enter ``contex.rules``:
  73. .. code-block:: python
  74. >>> contex.rules('\d+_Photo(?P<number>\d+)-(?P<year>\d+)\.jpg', {
  75. ... 'number': lambda n: '{:0>3}'.format(int(n) - 1),
  76. ... 'year': lambda y: y[-2:]
  77. ... }).apply('1_Photo032-2008.jpg')
  78. '1_Photo031-08.jpg'
  79. Or maybe I want to change the layout of the filename completely:
  80. .. code-block:: python
  81. >>> contex.rules('(\d+)_Photo(?P<number>\d+)-(?P<year>\d+)\.jpg', {
  82. ... 'number': lambda n: int(n) - 1,
  83. ... 'year': lambda y: y[-2:]
  84. ... }).expand('1_Photo032-2008.jpg', 'Photo_{1}_{number:0>3}-{year}.jpeg')
  85. 'Photo_1_031-08.jpeg'
  86. The string manipulation is done in one go. The programmer doesn't need to remember where the focus point is
  87. right now, or specify which order to do the replacements in. This is a much more *declarative* interface: you
  88. tell it what the string looks like, what changes you want made, and it figures out the rest. You don't need to
  89. stich the pieces back together, and can create more readable regular expressions as well because of that.
  90. Nested regex groups are also allowed: the nested one will be replaced first (which will make a difference if
  91. the replacement for the outer group is a callable).
  92. More advanced example
  93. ^^^^^^^^^^^^^^^^^^^^^
  94. Here's an example using ``re.search`` (as opposed to ``re.fullmatch``, which is the default):
  95. .. code-block:: python
  96. >>> contex.rules('(?P<millennium>\d)\d{3}', {
  97. ... 'millennium': lambda s: int(s)+1,
  98. ... 0: lambda y: '<span class="year">{}</span>'.format(y)
  99. ... }, method=re.search).apply('Current year: 2015')
  100. 'Current year: <span class="year">3015</span>'
  101. Notice that the ``'millennium'`` group is replaced before the ``0`` group.
  102. ``contex.rules`` is explained in more detail in its very long docstring.
  103. Doubtful stability
  104. ------------------
  105. In order to retrieve certain information about the regular expressions to resolve ambiguities related to 0-length
  106. matches and so on, I've seen it necessary to use ``sre_parse.parse`` to parse the regular expressions. This is
  107. an "internal support module" or something like that, and the stability of this library becomes doubtful as a result.
  108. My judgement was that it would take a lot of time and effort to create my own parser for python regular expressions,
  109. and I could easily create some bugs in that parser too.
  110. Conclusion
  111. ----------
  112. I hope that the examples of ``contex.rules`` I have given are sufficiently intuitive so that any programmer can look
  113. at them and infer pretty accurately what they do, because the whole point of this endeavor is to increase readability.
  114. Furthermore, I'd be interested to see if other people can take this idea ``^\w{7}``
  115. Using Contex
  116. ------------
  117. The ``contex`` package contains 5 functions:
  118. - ``rules(regex, rule_dict, method=re.fullmatch, flags=0)`` for declarative string manipulation.
  119. - ``T(string)`` for converting a string into a ``StringContext`` object.
  120. - ``search(string, pattern, flags=0)`` and
  121. - ``match(string, pattern, flags=0)`` for regex searches (with the same semantic difference as in the ``re`` module).
  122. They both return a ``MatchContext`` object.
  123. - ``find(string, substring, right_side=False)`` for finding a substring, returns a ``StringContext`` object.
  124. ``contex`` also contains the ``StringContext`` and ``MatchContext`` classes.
  125. Installing
  126. ----------
  127. ``contex`` should work in both Python 2.7 and 3.
  128. Install with ``$ pip install contex``. If you want to install for Python 3 you might want to replace ``pip`` with ``pip3``, depending on how your system is configured.
  129. Developing
  130. ----------
  131. Contex is documented and tested. Run ``$ nosetests`` or
  132. ``$ python3 setup.py test`` to run the tests. The code is hosted at https://notabug.org/Uglemat/Contex
  133. License
  134. -------
  135. The library is licensed under the GNU General Public License 3 or later.
  136. This README file is public domain.